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Preface 


If you have never studied statistics, you are probably unaware of the impact the sci- 
ence of statistics has on your everyday life. From knowing which medical treatments 
work best to choosing which television programs remain on the air, decision makers 
in almost every line of work rely on data and statistical studies to help them make 
wise choices. Statistics deals with complex situations involving uncertainty. We are 
exposed daily to information from surveys and scientific studies concerning our 
health, behavior, attitudes, and beliefs, or revealing scientific and technological 
breakthroughs. This book’s first objective is to help you understand this information 
and to sift the useful and the accurate from the useless and the misleading. My aims 
are to allow you to rely on your own interpretation of results emerging from surveys 
and studies and to help you read them with a critical eye so that you can make your 
own judgments. 

A second purpose of this book is to demystify statistical methods. Traditional 
statistics courses often place emphasis on how to compute rather than on how to un- 
derstand. This book focuses on statistical ideas and their use in real life. 

Finally, the book contains information that can help you make better decisions 
when faced with uncertainty. You will learn how psychological influences can keep 
you from making the best decisions, as well as new ways to think about coinci- 
dences, gambling, and other circumstances that involve chance events. 


Philosophical Approach 


If you are like most readers of this book, you will never have to produce statistical 
results in your professional life, and, if you do, a single statistics book or course 
would be inadequate preparation anyway. But certainly in your personal life and 
possibly in your professional life, you will have to consume statistical results pro- 
duced by others. Therefore, the focus of this book is on understanding the use of sta- 
tistical methods in the real world rather than on producing statistical results. There 
are dozens of real-life, in-depth case studies drawn from various media sources as 
well as scores of additional real-life examples. The emphasis is on understanding 
rather than computing, but the book also contains examples of how to compute im- 
portant numbers when necessary, especially when the computation is useful for 
understanding. 
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Preface 


Although this book is written as a textbook, it is also intended to be readable 
without the guidance of an instructor. Each concept or method is explained in plain 
language and is supported with numerous examples. 


Organization 


There are 27 chapters divided into four parts. Each chapter covers material more or 
less equivalent to a one-hour college lecture. The final chapters of Part 1 and Part 4 
consist solely of case studies and are designed to illustrate the thought process you 
should follow when you read studies on your own. 

By the end of Part 1, “Finding Data in Life,’ you will have the tools to determine 
whether or not the results of a study should be taken seriously; you will be able to 
detect false conclusions and biased results. In Part 2, “Finding Life in Data,” you will 
learn how to turn numbers into useful information and to quantify relationships be- 
tween such factors as aspirin consumption and heart attack rates or meditation and 
aging. You will also learn how to detect misleading graphs and figures and to inter- 
pret common economic statistics. 

Part 3 is called “Understanding Uncertainty in Life” and is designed to help you 
do exactly that. Every day we have to make decisions in the face of uncertainty. This 
part of the book will help you understand what probability and chance are all about 
and presents techniques that can help you make better decisions. The material on 
probability will also be useful when you read Part 4, “Making Judgments from Sur- 
veys and Experiments.” Some of the chapters in Part 4 are slightly more technical 
than the rest of the book, but once you have mastered them you will truly understand 
the beauty of statistical methods. Henceforth, when you read the results of a statisti- 
cal study, you will be able to tell whether the results represent valuable advice or 
flawed reasoning. Unless things have changed drastically by the time you read this, 
you will be amazed at the number of news reports that exhibit flawed reasoning. 


Thought Questions: Using Your Common Sense 


All of the chapters, except the one on ethics and those that consist solely of case 
studies, begin with a series of Thought Questions that are designed to be answered 
before you read the chapter. 

Most of the answers are based on common sense, perhaps combined with 
knowledge from previous chapters. Answering them before reading the chapter will 
reinforce the idea that most information in this book is based on common sense. 
You will find answers to the thought questions—or to similar questions—embed- 
ded in the chapter. 

In the classroom, the thought questions can be used for discussion at the begin- 
ning of each class. For relatively small classes, groups of students can be assigned 
to discuss one question each, then to report back to the class. If you are taking a class 
in which one of these formats is used, try to answer the questions on your own be- 
fore class. By doing so, you will build confidence as you learn that the material is 
not difficult to understand if you give it some thought. 
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Case Studies and Examples: Collect Your Own 


The book is filled with real-life Case Studies and Examples covering a wide range 
of disciplines. These studies and examples are intended to appeal to a broad audi- 
ence. In the rare instance where technical subject-matter knowledge is required, it is 
given with the example. Sometimes, the conclusion presented in the book will be dif- 
ferent from the one given in the original news report. This happens because many 
news reports misinterpret statistical results. 

I hope you find the case studies and examples interesting and informative; how- 
ever, you will learn the most by examining current examples on topics of interest to 
you. Follow any newspaper, news magazine, or Internet news site for awhile and you 
are sure to find plenty of illustrations of the use of surveys and studies. If you start 
collecting them now, you can watch your understanding increase as you work your 
way through this book. 


Formulas: It’s Your Choice 


If you dread mathematical formulas, you should find this book comfortably read- 
able. In most cases where computations are required, they are presented step by step 
rather than in a formula. The steps are accompanied by worked examples so that you 
can see exactly how to carry them out. 

On the other hand, if you prefer to work with formulas, each relevant chapter 
ends with a section called For Those Who Like Formulas. The section includes all 
the mathematical notation and formulas pertaining to the material in that chapter. 


Exercises and Mini-Projects 


Numerous Exercises appear at the end of each chapter. Many of them are similar to 
the Thought Questions and require an explanation for which there is no one correct 
answer. Answers to some of those with concise solutions are provided at the back of 
the book. These are indicated with an asterisk next to the exercise number. Teaching 
Seeing Through Statistics: An Instructor’s Resource Manual, which is available to 
instructors, explains what is expected for each exercise. 

In most chapters, the exercises contain many real-life examples. However, with 
the idea that you learn best by doing, most chapters also contain Mini-Projects. 
Some of these ask you to find examples of studies of interest to you; others ask you 
to conduct your own small-scale study. If you are reading this book without the ben- 
efit of a class or instructor, I encourage you to try some of the projects on your own. 


Covering the Book in a Quarter, in a Semester, 
or on Your Own 
I wrote this book for a one-quarter course taught three times a week at the Univer- 


sity of California at Davis as part of the general education curriculum. My aim was 
to allow one lecture for each chapter, thus allowing for completion of the book (and 
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a midterm or two) in the usual 29- or 30-lecture quarter. When I teach the course, I 
do not cover every detail from each chapter; I expect students to read some material 
on their own. 

If the book is used for a semester course, it can be covered at a more leisurely 
pace and in more depth. For instance, two classes a week can be used for covering 
new material and a third class for discussion, additional examples, or laboratory 
work. Alternatively, with three regular lectures a week, some chapters can be cov- 
ered in two sessions instead of one. 

Instructors can obtain a copy of Teaching Seeing Through Statistics: An Instruc- 
tor’s Resource Manual, which contains additional information on how to cover the 
material in one quarter or semester. The manual also includes tips on teaching this 
material, ideas on how to cover each chapter, sample lectures, additional examples, 
and exercise solutions. 

Instructors who want to focus on more in-depth coverage of specific topics may 
wish to exclude others. Certain chapters can be omitted without interrupting the flow 
of the material or causing serious consequences in later chapters. These include 
Chapter 9 on plots and graphs, Chapters 14 and 15 on economic data and time series 
(but Chapter 15 relies on Chapter 14), Chapters 17 and 18 on psychological and in- 
tuitive misunderstandings of probability, Chapter 25 on meta-analysis, and Chapter 
26 on ethics. 

If you are reading this book on your own, you may want to concentrate on se- 
lected topics only. Parts 1 and 3 can be read alone, as can Chapters 9 and 14. Part 4 
relies most heavily on Chapters 8, 12, 13, and 16. Although Part 4 is the most tech- 
nically challenging part of the book, I strongly recommend reading it because it is 
there that you will truly learn the beauty as well as the pitfalls of statistical reasoning. 

If you get stuck, try to step back and reclaim the big picture. Remember that al- 
though statistical methods are very powerful and are subject to abuse, they were de- 
veloped using the collective common sense of researchers whose goal was to figure 
out how to find and interpret information to understand the world. They have done 
the hard work; this book is intended to help you make sense of it all. 


Changes from the First to the Second Edition 


A book like this one is probably only as interesting as the examples and stories it re- 
lates, so for the second edition, numerous fresh examples and case studies were 
added. Over 100 new exercises were also added, many based on news stories. In the 
short time between the first and second editions, Internet use skyrocketed, and so the 
second edition included many examples from and references to Web sites with in- 
teresting data. 

The most substantial structural change from the first to the second edition was in 
Part 3. Using feedback from instructors, Chapters 15 and 16 from the first edition 
were combined and altered to make the material more relevant to daily life. Some of 
that material was moved to the subsequent two chapters (Chapters 16 and 17 in the 
second edition). Box plots were added to Chapter 7, and Chapter 13 was rewritten to 
reflect changes in the Consumer Price Index. Wording and data were updated 
throughout the book as needed. 
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New for the Third Edition 


There are four major changes from the second edition. First, an Appendix has been 
added containing 20 news stories, which are used in new examples and exercises 
throughout the book. These are tied to full journal articles, most of which are on a 
CD accompanying the book. The CD (which is the second major change) contains 
interactive applets as well. Third, some material has been reorganized and expanded, 
and a new chapter on Ethics has been added. Finally, new exercises and mini- 
projects have been added, most of which take advantage of the news stories and 
journal articles in the Appendix and on the CD. Additional details for these changes 
follow. 


An Appendix of News Stories and a CD have been added: One of the goals of this 
book is to help you understand news stories based on statistical studies. To enhance 
that goal, 20 news stories are provided in the Appendix. But the news stories tell only 
part of the story. When journalists write such stories, they rely on original sources, 
which in most cases include an article in a technical journal, or a technical report 
prepared by researchers. To give you that same exposure, the CD accompanying the 
book contains the full text of the original sources for most of the news stories. Be- 
cause these articles include hundreds of pages, it would not have been possible to ap- 
pend the printed versions of them. Having immediate access to these reports allows 
you to learn much more about how the research was conducted, what statistical 
methods were used and what conclusions the original researchers drew. You can then 
compare these to the news stories derived from them, and determine whether you 
think those stories are accurate and complete. In some cases an additional news story 
or press release is included on the CD as well. 

The CD also includes computer applets that will allow you to explore some of the 
concepts in this book in an interactive way. Your book should have been accompanied 
by an Activities Manual that includes suggestions for how to explore these applets. 


New chapters and sections have been added: In response to feedback from users, 
Chapter 12 from the second edition has been expanded and divided into Chapters 12 
and 13. Chapter 13 now includes a more complete introduction to hypothesis test- 
ing, which will help prepare you for Part 4 of the book. As a consequence, all of the 
remaining chapters are renumbered. 

There is also a new chapter, Chapter 26, called “Ethics in Statistical Studies.” As 
you have probably heard, some people think that you can use statistics to prove (or 
disprove) anything. That’s not quite true, but it is true that there are multiple ways 
that researchers can naively or intentionally bias the results of their studies. Ethical 
researchers have a responsibility to make sure that doesn’t happen. As an educated 
consumer, you have a responsibility to ask the right questions to determine if some- 
thing unethical has occurred. Chapter 26 illustrates some subtle (and not so subtle) 
ways in which ethics play a role in research. 

New sections have been added to Chapters 2, 5, 7, 12 and 22 (formerly Chapter 
21). New examples and case studies have been added in various chapters. 


New exercises have been added: With the recognition that some students may be us- 
ing the previous edition, I have tried to leave the numbering of the exercises within 
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each chapter consistent with the numbering from the previous edition. Many new ex- 
ercises have been added, but in most cases they were added to the end of the exist- 
ing exercises so as not to change those numbers. Many of the new exercises refer to 
the news stories in the Appendix and the original reports on the CD. 


Web Site for Seeing Through Statistics 


The Duxbury Resource Center for Statistical Literacy has been established for users 
of this book; the URL is http://statistics.duxbury.com/utts3e. This site includes a 
variety of resources and information of interest to users of this book. 
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Finding Data in Life 


B, the time you finish reading Part 1 of this book, you will be reading studies 
reported in the newspaper with a whole new perspective. In these chapters, 
you will learn how researchers should go about collecting information for sur- 
veys and experiments. You will learn to ask questions, such as who funded the 
research, that could be important in deciding whether the results are accurate 
and unbiased. 

Chapter 1 is designed to give you some appreciation for how statistics helps 
to answer interesting questions. Chapters 2 to 5 provide an in-depth, behind- 
the-scenes look at how surveys and experiments are supposed to be done. In 
Chapter 6, you will learn how to tie together the information from the previ- 
ous chapters, including seven steps to follow when reading about studies. 

These steps all lead to the final step, which is the one you should care about 
the most. You will have learned how to determine whether the results of a 
study are meaningful enough to encourage you to change your lifestyle, atti- 


tudes, or beliefs. 
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The Benefits and Risks 
of Using Statistics 


Thought Questions 


1. A recent newspaper article concluded that smoking marijuana at least three times a 
week resulted in lower grades in college. How do you think the researchers came to 
this conclusion? Do you believe it? Is there a more reasonable conclusion? 


2. It is obvious to most people that, on average, men are taller than women, and yet 
there are some women who are taller than some men. Therefore, if you wanted to 
“prove” that men were taller, you would need to measure many people of each sex. 
Here is a theory: On average, men have lower resting pulse rates than women do. 
How could you go about trying to prove or disprove that? Would it be sufficient to 
measure the pulse rates of one member of each sex? Two members of each sex? 
What information about men’s and women’s pulse rates would help you decide how 
many people to measure? 


3. Suppose you were to learn that the large state university in a particular state gradu- 
ated more students who eventually went on to become millionaires than any of the 
small liberal arts colleges in the state. Would that be a fair comparison? How should 
the numbers be presented in order to make it a fair comparison? 


4. In its March 3-5, 1995 issue, USA Weekend magazine asked readers to return a sur- 
vey with a variety of questions about sex and violence on television. Of the 65,142 
readers who responded, 97% were “very or somewhat concerned about violence on 
TV" (USA Weekend, 2-4 June 1995, p. 5). Based on this survey, can you conclude 
that about 97% of U.S. citizens are concerned about violence on TV? Why or 
why not? 
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1.1 Statistics 


CASE STUDY 1.1 


When you hear the word statistics, you probably either get an attack of math anxi- 
ety or think about lifeless numbers, such as the population of the city or town where 
you live, as measured by the latest census, or the per capita income in Japan. The 
goal of this book is to open a whole new world of understanding of the term statis- 
tics. By the time you finish reading this book, you will realize that the invention of 
statistical methods is one of the most important developments of modern times. 
These methods influence everything from life-saving medical advances to which 
television shows remain on the air. 

The word statistics is actually used to mean two different things. The better- 
known definition is that statistics are numbers measured for some purpose. A more 
appropriate, complete definition is the following: 


Statistics is a collection of procedures and principles for gaining and 
analyzing information in order to help people make decisions when faced with 
uncertainty. 


Using this definition, you have undoubtedly used statistics in your own life. For ex- 
ample, if you were faced with a choice of routes to get to school or work, or to get 
between one classroom building and the next, how would you decide which one to 
take? You would probably try each of them a number of times (thus gaining infor- 
mation) and then choose the best one according to some criterion important to you, 
such as speed, fewer red lights, more interesting scenery, and so on. You might even 
use different criteria on different days—such as when the weather is pleasant versus 
when it is not. In any case, by sampling the various routes and comparing them, you 
would have gained and analyzed useful information to help you make a decision. 

In this book, you will learn ways to intelligently improve your own methods for 
collecting and analyzing complex information. You will learn how to interpret infor- 
mation that others have collected and analyzed and how to make decisions when 
faced with uncertainty. In Case Study 1.1, we will see how one researcher followed 
a casual observation to a fascinating conclusion. 


Heart or Hypothalamus? 
Source: Salk (1973), pp. 26-29. 


You can learn a lot about nature by observation. You can learn even more by con- 
ducting a carefully controlled experiment. This case study has both. It all began 
when psychologist Lee Salk noticed that despite his knowledge that the hypothala- 
mus plays an important role in emotion, it was the heart that seemed to occupy the 
thoughts of poets and songwriters. There were no everyday expressions or song ti- 
tles such as “I love you from the bottom of my hypothalamus” or “My hypothala- 
mus longs for you.” Yet, there was no physiological reason for suspecting that the 
heart should be the center of such attention. Why had it always been the designated 
choice? 
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Salk began wondering about the role of the heart in human relationships. He also 
noticed that when on 42 separate occasions he watched a rhesus monkey at the zoo 
holding her baby, she held the baby on the left side, close to her heart, on 40 of those 
occasions. He then observed 287 human mothers within 4 days after giving birth and 
noticed that 237, or 83%, held their babies on the left. Handedness did not explain 
it; 83% of the right-handed mothers and 78% of the left-handed mothers exhibited 
the left-side preference. When asked why they chose the left side, the right-handed 
mothers said it was so their right hand would be free. The left-handed mothers said 
it was because they could hold the baby better with their dominant hand. In other 
words, both groups were able to rationalize holding the baby on the left based on 
their own preferred hand. 

Salk wondered if the left side would be favored when carrying something other 
than a newborn baby. He found a study in which shoppers were observed leaving a 
supermarket carrying a single bag; exactly half of the 438 adults carried the bag on 
the left. But when stress was involved, the results were different. Patients at a den- 
tist’s office were asked to hold a 5-inch rubber ball while the dentist worked on their 
teeth. Substantially more than half held the ball on the left. 

Salk speculated, “It is not in the nature of nature to provide living organisms with 
biological tendencies unless such tendencies have survival value.” He surmised that 
there must indeed be survival value to having a newborn infant placed close to the 
sound of its mother’s heartbeat. 

To test this conjecture, Salk designed a study in a baby nursery at a New York 
City hospital. He arranged for the nursery to have the continuous sound of a human 
heartbeat played over a loudspeaker. At the end of 4 days, he measured how much 
weight the babies had gained or lost. Later, with a new group of babies in the nurs- 
ery, no sound was played. Weight gains were again measured after 4 days. 

The results confirmed what Salk suspected. Although they did not eat more than 
the control group, the infants treated to the sound of the heartbeat gained more 
weight (or lost less). Further, they spent much less time crying. Salk’s conclusion 
was that “newborn infants are soothed by the sound of the normal adult heartbeat.” 
Somehow, mothers intuitively know that it is important to hold their babies on the 
left side. What had started as a simple observation of nature led to a further under- 
standing of an important biological response of a mother to her newborn infant. 


1.2 Detecting Patterns and Relationships 


Some differences are obvious to the naked eye, such as the fact that the average man 
is taller than the average woman. If we were content to know about only such obvi- 
ous relationships, we would not need the power of statistical methods. But had you 
noticed that babies who listen to the sound of a heartbeat gain more weight? Have 
you ever noticed that taking aspirin helps prevent heart attacks? How about the fact 
that people are more likely to buy blue jeans in certain months of the year than in 
others? The fact that men have lower resting pulse rates than women do? The fact 
that listening to Mozart improves performance on the spatial reasoning questions of 
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an IQ test? All of these are relationships that have been demonstrated in studies us- 
ing proper statistical methods, yet none of them are obvious to the naked eye. 

Let’s take the simplest of these examples—one you can test yourself—and see 
what’s needed to properly demonstrate the relationship. Suppose you wanted to ver- 
ify the claim that, on average, men have lower resting pulse rates than women do. 
Would it be sufficient to measure only your own pulse rate and that of a friend of the 
opposite sex? Obviously not. Even if the pair came out in the predicted direction, the 
singular measurements would certainly not speak for all members of each sex. 

It is not easy to conduct a statistical study properly, but it is easy to understand 
much of how it should be done. We will examine each of the following concepts in 
great detail in the remainder of this book; here we just introduce them, using the sim- 
ple example of comparing male and female pulse rates. 


To conduct a statistical study properly, one must 


1. Get a representative sample. 
2. Get a large enough sample. 


3. Decide whether the study should be an observational study or a random- 
ized experiment. 


1. Get a representative sample. Most researchers hope to extend their results be- 
yond just the participants in their research. Therefore, it is important that the people 
or objects in a study be representative of the larger group for which conclusions are 
to be drawn. We call those who are actually studied a sample and the larger group 
from which they were chosen a population. (In Chapter 4 we will learn some ways 
to select a proper sample.) For comparing pulse rates, it may be convenient to use 
the members of your class. But this sample would not be valid if there were some- 
thing about your class that would relate pulse rates and sex, such as if the entire 
men’s track team happened to be in the class. It would also be unacceptable if you 
wanted to extend your results to an age group much different from the distribution 
of ages in your class. Often researchers are constrained to using such “convenience” 
samples, and we will discuss the implications of this later in the book. 


2. Get a large enough sample. Even experienced researchers often fail to recognize 
the importance of this concept. In Part 4 of this book, you will learn how to detect the 
problem of a sample that is too small; you will also learn that such a sample can 
sometimes lead to erroneous conclusions. In comparing pulse rates, collecting one 
pulse rate from each sex obviously does not tell us much. Is two enough? Four? One 
hundred? The answer to that question depends on how much variability there is 
among pulse rates. If all men had pulse rates of 65 and all women had pulse rates of 
75, it wouldn’t take long before you recognized a difference. However, if men’s pulse 
rates ranged from 50 to 80 and women’s pulse rates ranged from 52 to 82, it would 
take many more measurements to convince you of a difference. The question of how 
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large is “large enough” is closely tied to how diverse the measurements are likely to 
be within each group. The more diverse, or variable, the individuals within each 
group, the larger the sample needs to be to detect a real difference between the groups. 


3. Decide whether the study should be an observational study or a randomized 
experiment. For comparing pulse rates, it would be sufficient to measure or “‘ob- 
serve” both the pulse rate and the sex of the people in our sample. When we merely 
observe things about our sample, we are conducting an observational study. How- 
ever, if we were interested in whether frequent use of aspirin would help prevent 
heart attacks (which has been suggested as a likely possibility), it would not be suf- 
ficient to simply observe whether people frequently took aspirin and then whether 
they had a heart attack. It could be that people who were more concerned with their 
health were both more likely to take aspirin and less likely to have a heart attack, or 
vice versa. Or, it could be that drinking the extra glass of water required to take the 
aspirin contributes to better health. 


To be able to make a causal connection, we would have to conduct a random- 
ized experiment in which we randomly assigned people to one of two groups. Ran- 
dom assignments are made by doing something akin to flipping a coin to determine 
the group membership for each person. In one group, people would be given aspirin, 
and in the other, they would be given a dummy pill that looked like aspirin. So as not 
to influence people with our expectations, we would not tell people which one they 
were taking until the experiment was concluded. In Case Study 1.2, we briefly ex- 
amine such an experiment; in Chapter 5 we discuss these ideas in much more detail. 


Does Aspirin Prevent Heart Attacks? 


In 1988, the Steering Committee of the Physicians’ Health Study Research Group re- 
leased the results of a 5-year randomized experiment conducted using 22,071 male 
physicians between the ages of 40 and 84. The physicians had been randomly as- 
signed to two groups. One group took an ordinary aspirin tablet every other day, 
whereas the other group took a “placebo,” a pill designed to look just like an aspirin 
but with no active ingredients. Neither group knew whether they were taking the ac- 
tive ingredient. 

The results, shown in Table 1.1, support the conclusion that taking aspirin does 
indeed help reduce the risk of having a heart attack. The rate of heart attacks in the 
group taking aspirin was only 55% of the rate of heart attacks in the placebo group, 
or just slightly more than half as big. Because the men were randomly assigned to 
the two conditions, other factors, such as amount of exercise, should have been sim- 
ilar for both groups. The only substantial difference in the two groups should have 
been whether they took the aspirin or the placebo. Therefore, we can conclude that 
taking aspirin caused the lower rate of heart attacks for that group. 

Notice that because the participants were all male physicians, these conclusions 
may not apply to the general population of men. They may not apply to women at 
all because no women were included in the study. More recent evidence has pro- 
vided even more support for this effect, however, something we will examine in 
more detail in an example in Chapter 27. 
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Table 1.1 The Effect of Aspirin on Heart Attacks 


Condition Heart Attack No Heart Attack Attacks per 1000 


Aspirin 104 10,933 9.42 
Placebo 189 10,845 17.13 


1.3 Don’t Be Deceived by Improper Use of Statistics 


EXAMPLE 1 


EXAMPLE 2 


EXAMPLE 3 


Let’s look at some examples representative of the kinds of abuses of statistics you 
may see in the media. In the first example, the simple principles we have been dis- 
cussing were violated; in the second example, the statistics have been taken out of 
their proper context; and in the third and fourth examples, you will see how to stop 
short of making too strong a conclusion on the basis of an observational study. 


In 1986, a business-oriented magazine published in Washington, D.C., conducted a 
survey that concluded that Chrysler president Lee lacocca would beat Vice President 
George Bush in a Republican primary by a margin of 54% to 47% [sic]. Further reading 
revealed that the poll was based on questionnaires mailed to 2000 of the magazine's 
readers, surely a biased sample of American voters. To make matters worse, the results 
were compiled from only the first 200 respondents. It should not surprise you to learn 
that those who feel strongly about an issue, especially those who would like to see a 
change, are most likely to respond to a survey received in the mail. Therefore, the 
“sample” was not at all representative of the “population” of all people likely to vote in 
a Republican primary election. (In the next election year, 1988, George Bush not only 
won the Republican primary, but went on to win the presidential election with 54% of 
the popular vote.) Oo 


When a federal air report ranked the state of New Jersey as 22nd in the nation in its 
release of toxic chemicals, the New Jersey Department of Environmental Protection 
happily took credit (Wang, 1993, p. 170). The statistic was based on a reliable source, a 
study by the U.S. Environmental Protection Agency. However, the ranking had been 
made based on total pounds released, which was 38.6 million for New Jersey. When this 
total was turned into pounds per square mile in the state, it became apparent New 
Jersey was one of the worst—fourth on the list. Because New Jersey is one of the 
smallest states by area, the figures were quite misleading until adjusted for size. Oo 


Read the article in Figure 1.1, and then read the headline again. Notice that the 
headline stops short of making a causal connection between smoking during 
pregnancy and lower IQs in children. Reading the article, you can see that the results 
are based on an observational study and not an experiment—with good reason: It 
would clearly be unethical to randomly assign pregnant women to either smoke or not. 
With studies like this, the best that can be done is to try to measure and statistically 


Figure 1.1 

Don't make causal con- 
nections from observa- 
tional studies 


Source: “Study: Smoking May 
Lower Kids’ IQs.” Associated 
Press, February 11, 1994. 
Reprinted with permission. 
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Study: Smoking May Lower Kids’ IQs 


RocHEsTerR, N.Y. (AP)—Secondhand 
smoke has little impact on the intelligence 
scores of young children, researchers 
found. 

But women who light up while preg- 
nant could be dooming their babies to 
lower IQs, according to a study released 
Thursday. 

Children ages 3 and 4 whose mothers 
smoked 10 or more cigarettes a day dur- 
ing pregnancy scored about 9 points 
lower on the intelligence tests than the 
offspring of nonsmokers, researchers at 
Cornell University and the University of 


Rochester reported in this month’s Pedi- 
atrics journal. 

That gap narrowed to 4 points against 
children of nonsmokers when a wide range 
of interrelated factors were controlled. The 
study took into account secondhand smoke 
as well as diet, education, age, drug use, 
parents’ IQ, quality of parental care and 
duration of breast feeding. 

“It is comparable to the effects that 
moderate levels of lead exposure have on 
children’s IQ scores,” said Charles Hender- 
son, senior research associate at Cornell’s 
College of Human Ecology in Ithaca. 
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adjust for other factors that might be related to both smoking behavior and children’s 
IQ scores. Notice that when the researchers did so, the gap in IQ between the children 
of smokers and nonsmokers narrowed from 9 points down to 4 points. There may be 
even more factors that the researchers did not measure that would account for the 
remaining 4-point difference. Unfortunately, with an observational study, we simply 
cannot make causal conclusions. We will explore this particular example in more detail 
in Chapter 6. a 


An article headlined “New study confirms too much pot impairs brain” read as follows: 


More evidence that chronic marijuana smoking impairs mental ability: Researchers 
at the University of lowa College of Medicine say a test shows those who smoke 
seven or more marijuana joints per week had lower math, verbal and memory 
scores than non-marijuana users. Scores were particularly reduced when marijuana 
users held a joint’s smoke in their lungs for longer periods. (San Francisco Exam- 
iner, 13 March 1993, p. D-1) 


This research was clearly based on an observational study because people cannot be ran- 
domly assigned to either smoke marijuana or not. The headline is misleading because it 
implies that there is a causal connection between smoking marijuana and brain func- 
tioning. All we can conclude from an observational study is that there is a relationship. 
It could be the case that people who choose to smoke marijuana are those who would 
score lower on the tests anyway. a 
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CASE STUDY 1.3 


A Mistaken Accusation of Cheating 


Klein (1992) described a situation in which two students were accused of cheating 
on a multiple-choice medical licensing exam. They had been observed whispering 
during one part of the 3-day exam and their answers to the questions they got wrong 
very often matched each other. The licensing board determined that the statistical 
evidence for cheating was overwhelming. They estimated that the odds of two peo- 
ple having answers as close as these two did were less than 1 in 10,000. Further, the 
students were husband and wife. Their tests were invalidated. 

The case went to trial, and upon further investigation the couple was exonerated. 
They hired a statistician who was able to show that the agreement in their answers 
during the session in which they were whispering was no higher than it was in the 
other sessions. What happened? The board assumed students who picked the wrong 
answer were simply guessing among the other choices. This couple had grown up 
together and had been educated together in India. Answers that would have been cor- 
rect for their culture and training were incorrect for the American culture (for exam- 
ple, whether a set of symptoms was more indicative of tuberculosis or a common 
cold). Their common mistakes often would have been the right answers for India. So, 
the licensing board erred in calculating the odds of getting such a close match by us- 
ing the assumption that they were just guessing. And, according to Klein, “with re- 
gard to their whispering, it was very brief and had to do with the status of their sick 
child” (p. 26). 


1.4 Summary and Conclusions 


In this chapter, we have just begun to examine both the advantages and the dangers 
of using statistical methods. We have seen that it is not enough to know the results of 
a study, survey, or experiment. We also need to know how those numbers were col- 
lected and who was asked. In the upcoming chapters, you will learn much more about 
how to collect and process this kind of information properly and how to detect prob- 
lems in what others have done. You will learn that a relationship between two char- 
acteristics (such as smoking marijuana and lower grades) does not necessarily mean 
that one causes the other, and you will learn how to determine other plausible expla- 
nations. In short, you will become an educated consumer of statistical information. 


Exercises 


Asterisked (*) exercises are included in the Solutions at the back of the book. 


1. Explain why the relationship shown in Table 1.1, concerning the use of aspirin 
and heart attack rates, can be used as evidence that aspirin actually prevents 
heart attacks. 


2. “People who often attend cultural activities, such as movies, sports events and 
concerts, are more likely than their less cultured cousins to survive the next eight 


*3. 


*8. 
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to nine years, even when education and income are taken into account, accord- 

ing to a survey by the University of Umea in Sweden” (American Health, April 

1997, p. 20). 

a. Can this claim be tested by conducting a randomized experiment? Explain. 

b. On the basis of the study that was conducted, can we conclude that attend- 
ing cultural events causes people to be likely to live longer? Explain. 

c. The article continued “No one’s sure how Mel Gibson and Mozart help 
health, but the activities may enhance immunity or coping skills.” Comment 
on the validity of this statement. 

d. The article notes that education and income were taken into account. Give 
two examples of other factors about the people surveyed that you think 
should also have been taken into account. 

Explain why the number of people in a sample is an important factor to consider 

when designing a study. 


. Explain what problems arise in trying to make conclusions based on a survey 


mailed to the subscribers of a specialty magazine. Find or construct an example. 


. “If you have borderline high blood pressure, taking magnesium supplements 


may help, Japanese researchers report. Blood pressure fell significantly in sub- 
jects who got 400-500 milligrams of magnesium a day for four weeks, but not 
in those getting a placebo” (USA Weekend, 22—24 May 1998, p. 11). 


a. Do you think this was a randomized experiment or an observational study? 
Explain. 

b. Do you think the relationship found in this study is a causal one, in which 
taking magnesium actually causes blood pressure to be lowered? Explain. 


. Refer to Case Study 1.1. When Salk measured the results, he divided the babies 


into three groups based on whether they had low (2510 to 3000 g), medium 
(3010 to 3500 g), or high (3510 g and over) birthweights. He then compared the 
infants from the heartbeat and silent nurseries separately within each birth- 
weight group. Why do you think he did that? (Hint: Remember that it would be 
easier to detect a difference in male and female pulse rates if all males measured 
65 beats per minute and all females measured 75 than it would be if both groups 
were quite diverse.) 


. A psychology department is interested in comparing two methods for teaching 


introductory psychology. Four hundred students plan to enroll for the course at 
10:00 a.m. and another 200 plan to enroll for the course at 4:00 p.m. The regis- 
trar will allow the department to teach multiple sections at each time slot, and 
to assign students to any one of the sections taught at the student’s desired time. 
Design a study to compare the two teaching methods. For example, would it be 
a good idea to use one method on all of the 10:00 sections and the other method 
on all of the 4:00 sections? Explain your reasoning. 


Suppose you have a choice of two grocery stores in your neighborhood. Be- 
cause you hate waiting, you want to choose the one for which there is gener- 
ally a shorter wait in the checkout line. How would you gather information to 
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*9, 


10. 


11. 


12. 


*13, 


14. 


*15, 


determine which one is faster? Would it be sufficient to visit each store once 
and time how long you had to wait in line? Explain. 


Suppose researchers want to know whether smoking cigars increases the risk of 
esophageal cancer. 


*a. Could they conduct a randomized experiment to test this? Explain. 


*b. If they conducted an observational study and found that cigar smokers had a 
higher rate of esophageal cancer than those who did not smoke cigars, could 
they conclude that smoking cigars increases the risk of esophageal cancer? 
Explain why or why not. 


Universities are sometimes ranked for prestige according to the amount of re- 
search funding their faculty members are able to obtain from outside sources. 
Explain why it would not be fair to simply use total dollar amounts for each uni- 
versity, and describe what should be used instead. 


Refer to Case Study 1.3, in which two students were accused of cheating be- 
cause the licensing board determined the odds of such similar answers were 
less than 1 in 10,000. Further investigation revealed that over 20% of all pairs 
of students had matches giving low odds like these (Klein, 1992, p. 26). 
Clearly, something was wrong with the method used by the board. Read the 
case study and explain what erroneous assumption they made in their determi- 
nation of the odds. (Hint: Use your own experience with answering multiple- 
choice questions.) 


Suppose the officials in the city or town where you live would like to ask ques- 
tions of a “representative sample” of the adult population. Explain some of the 
characteristics this sample should have. For example, would it be sufficient to 
include only homeowners? 


Suppose you have 20 tomato plants and want to know if fertilizing them will 
help them produce more fruit. You randomly assign 10 of them to receive fertil- 
izer and the remaining 10 to receive none. You otherwise treat the plants in an 
identical manner. 


*a. Explain whether this would be an observational study or a randomized 
experiment. 


*b. If the fertilized plants produce 30% more fruit than the unfertilized plants, 
can you conclude that the fertilizer caused the plants to produce more? 
Explain. 


Give an example of a decision in your own life, such as which route to take to 
school, for which you think statistics would be useful in making the decision. 
Explain how you could collect and process information to help make the 
decision. 


National polls are often conducted by asking the opinions of a few thousand 
adults nationwide and using them to infer the opinions of all adults in the 
nation. Explain who is in the sample and who is in the population for such 
polls. 


Mini-Projects 


16. 


*17. 


18. 


19. 


20. 
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Sometimes television news programs ask viewers to call and register their opin- 
ions about an issue. One number is to be called for a “yes” opinion and another 
number for a “no” vote. Do you think viewers who call are a representative sam- 
ple of all viewers? Explain. 


Suppose a study first asked people whether they meditate regularly and then 
measured their blood pressures. The idea would be to see if those who meditate 
have lower blood pressure than those who do not do so. 


*ą. Explain whether this would be an observational study or a randomized 
experiment. 


*b. If it were found that meditators had lower-than-average blood pressures, can 
we conclude that meditation causes lower blood pressure? Explain. 


Suppose a researcher would like to determine whether one grade of gasoline 
produces better gas mileage than another grade. Twenty cars are randomly di- 
vided into two groups, with 10 cars receiving one grade and 10 receiving the 
other. After many trips, average mileage is computed for each car. 

a. Would it be easier to detect a difference in gas mileage for the two grades if 
the 20 cars were all the same size, or would it be easier if they covered a wide 
range of sizes and weights? Explain. 

b. What would be one disadvantage to using cars that were all the same size? 

Suppose the administration at your school wants to know how students feel 

about a policy banning smoking on campus. Because they can’t ask all students, 

they must rely on a sample. 

a. Give an example of a sample they could choose that would not be represen- 
tative of all students. 

b. Explain how you think they could get a representative sample. 

A newspaper headline read, “Study finds walking a key to good health: Six brisk 

outings a month cut death risk.” Comment on what type of study you think was 

done and whether this is a good headline. 


. Design and carry out a study to test the proposition that men have lower resting 


pulse rates than women. 


. Find a newspaper or Web article that discusses a recent study involving statisti- 


cal methods. Identify the study as either an observational study or a randomized 
experiment. Comment on how well the simple concepts discussed in this chap- 
ter have been applied in the study. Comment on whether the news article, in- 
cluding the headline, accurately reports the conclusions that can legitimately be 
made from the study. Finally, discuss whether any information is missing from 
the news article that would have helped you answer the previous questions. 
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Reading the News 


Thought Questions 


1. 


Advice columnists sometimes ask readers to write and express their feelings about 
certain topics. For instance, Ann Landers once asked readers whether they thought 
engineers made good husbands. Do you think the responses are representative of 
public opinion? Explain why or why not. 


. Taste tests of new products are often done by having people taste both the new 


product and an old familiar standard. Do you think the results would be biased if the 
person handing the products to the respondents knew which was which? Explain 
why or why not. 


. Nicotine patches attached to the arm of someone who is trying to quit smoking dis- 


pense nicotine into the blood. Suppose you read about a study showing that nico- 
tine patches were twice as effective in getting people to quit smoking as “control” 
patches (made to look like the real thing). Further, suppose you are a smoker trying 
to quit. What questions would you want answered about how the study was done 
and its results before you decided whether to try the patches yourself? 


. For a door-to-door survey on opinions about various political issues, do you think it 


matters who conducts the interviews? Give an example of how it might make a 
difference. 


15 


16 


PART 1 


Finding Data in Life 


2.1 The Educated Consumer of Data 


Pick up any newspaper or newsmagazine and you are almost certain to find a story 
containing conclusions based on data. Should you believe what you read? Not al- 
ways. It depends on how the data were collected, measured, and summarized. In this 
chapter, we discuss seven critical components of statistical studies. We examine the 
kinds of questions you should ask before you believe what you read. We go into fur- 
ther detail about these issues in subsequent chapters. The goal in this chapter is to 
give you an overview of how to be a more educated consumer of the data you en- 
counter in your everyday life. 


What Are Data? 


In statistical parlance, data is a plural word referring to a collection of numbers or 
other pieces of information to which meaning has been attached. For example, the 
numbers 1, 3, and 10 are not necessarily data, but they become so when we are told 
that these were the weight gains in grams of three of the infants in Salk’s heartbeat 
study, discussed in Chapter 1. In Case Study 1.2, the data consisted of two pieces of 
information measured for each participant: (1) whether they took aspirin or a 
placebo, and (2) whether they had a heart attack. 


Don’t Always Believe What You Read 


When you read the results of a study in the newspaper, you are rarely presented with 
the actual data. Someone has usually summarized the information for you, and he or 
she has probably already drawn conclusions and presented them to you. Don’t al- 
ways believe them. The meaning we can attach to data, and to the resulting conclu- 
sions, depends on how well the information was acquired and summarized. 

In the remaining chapters of Part 1, we look at proper ways to obtain data. In Part 
2, we turn our attention to how it should be summarized. In Part 4, we learn the 
power as well as the limitations of using the data collected from a sample to make 
conclusions about the larger population. In this chapter, we address seven features of 
statistical studies that you should think about when you read a news article. You will 
begin to be able to think critically and make your own conclusions about what you 
read. 


2.2 Origins of News Stories 


Where do news stories originate? How do reporters hear about events and determine 
that they are newsworthy? For stories based on statistical studies there are several 
possible sources. The two most common of these sources are also the most common 
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outlets for researchers to present the results of their work: academic conferences and 
scholarly journals. 

Every academic discipline holds conferences, usually annually, in which re- 
searchers can share their results with others. Reporters routinely attend these aca- 
demic conferences and look for interesting news stories. For larger conferences, 
there is usually a “press room” where researchers can leave press releases for the me- 
dia. If you pay attention, you will notice that in certain weeks of the year there will 
be several news stories about studies with related themes. For instance, the Ameri- 
can Psychological Association meets in August, and there are generally some news 
stories emerging from results presented there. The American Association for the Ad- 
vancement of Science meets in February, and news stories related to various areas of 
science will appear in the news that week. 

One problem with news stories based on conference presentations is that there is 
unlikely to be a corresponding written report by the researchers, so it is difficult for 
readers of the news story to obtain further information. News stories based on con- 
ference reports generally mention the name and date of the conference as well as the 
name and institution of the lead researcher, so sometimes it is possible to contact the 
researcher for further information. Some researchers make conference presentations 
available on their Web sites. 

In contrast, many news stories about statistical studies are based on published ar- 
ticles in scholarly journals. Reporters routinely read these journals when they are 
published, or they get advance press releases from the journal offices. News stories 
based on journal articles usually mention the journal and date of publication, so if 
you are interested in learning more about the study, you can obtain the original jour- 
nal article. Journal articles are sometimes available on the journal’s Web site or on 
the Web site of the author(s). You can also write to the lead author and request that 
a “reprint” be sent to you. 

As a third source of news stories about statistical studies, some government and 
private agencies release in-depth research reports. Unlike journal articles, these re- 
ports are not necessarily “peer-reviewed” or checked by neutral experts on the topic. 
An advantage of these reports is that they are not restricted by space limitations im- 
posed by journals and often provide much more in-depth information than do jour- 
nal articles. 

A supplementary source from which news stories may originate is a university 
media office. Most research universities have an office that provides press releases 
when faculty members have completed research that may be of interest to the pub- 
lic. The timing of these news releases usually corresponds to a presentation at an 
academic conference or publication of results in an academic journal, but the news 
release summarizes the information so that journalists don’t have to be as versed in 
the technical aspects of the research to write a good story. When you read about a 
study in the news and would like more information, the news office of the lead re- 
searcher’s institution is a good place to start looking. They may have issued a press 
release on which the story was based. 
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News Stories and Original Sources in the Appendix and on the CD 


To illustrate how the concepts in this book are used in research and eventually 
converted into news stories, there is a collection of examples included with 
this book. In each case, the example includes a story from a newspaper, mag- 
azine, or Web site, and these are printed in the Appendix and on the CD ac- 
companying the book. Sometimes there is also a press release. These are 
provided as an additional “News Story” and included on the CD. Most of the 
news stories are based on articles from scholarly journals or detailed reports. 
Many of these articles are printed in full on the CD, labeled as the “Original 
Source.” Throughout this book you will find a CD icon when you 
need to refer to the material on the CD. By comparing the news 
© story and the original source, you will learn how to evaluate what 
is reported in the news. 


2.3 How to Be a Statistics Sleuth: 
Seven Critical Components 


Reading and interpreting the results of surveys or experiments is not much different 
from reading and interpreting the results of other events of interest, such as sports 
competitions or criminal investigations. If you are a sports fan, then you know what 
information should be included in reports of competitions and you know when cru- 
cial information is missing. If you have ever been involved in an event that was later 
reported in the newspaper, you know that missing information can lead readers to er- 
roneous conclusions. 

In this section, you are going to learn what information should be included in 
news reports of statistical studies. Unfortunately, crucial information is often miss- 
ing. With some practice, you can learn to figure out what’s missing, as well as how 
to interpret what’s reported. You will no longer be at the mercy of someone else’s 
conclusions. You will be able to determine them for yourself. To provide structure to 
our examination of news reports, let’s list Seven Critical Components that determine 
the soundness of statistical studies. A good news report should provide you with in- 
formation about all of the components that are relevant to that study. 


Component 1: The source of the research and of the funding. 
Component 2: The researchers who had contact with the participants. 


Component 3: The individuals or objects studied and how they were 
selected. 


CHAPTER 2 Reading the News 19 


Component 4: The exact nature of the measurements made or questions 
asked. 


Component 5: The setting in which the measurements were taken. 


Component 6: Differences in the groups being compared, in addition to the 
factor of interest. 


Component 7: The extent or size of any claimed effects or differences. 


Before delving into some examples, let’s examine each component more closely. 
You will find that most of the problems with studies are easy to identify. Listing 
these components simply provides a framework for using your common sense. 


Component 1: The source of the research and of the funding Studies are con- 
ducted for three major reasons. First, governments and private companies need to 
have data in order to make wise policy decisions. Information such as unemploy- 
ment rates and consumer spending patterns are measured for this reason. Second, re- 
searchers at universities and other institutions are paid to ask and answer interesting 
questions about the world around us. The curious questioning and experimentation 
of such researchers have resulted in many social, medical, and scientific advances. 
Much of this research is funded by government agencies, such as the National Insti- 
tutes of Health. Third, companies want to convince consumers that their programs 
and products work better than the competition’s, or special-interest groups want to 
prove that their point of view is held by the majority. 

Unfortunately, it is not always easy to discover who funded research. Many uni- 
versity researchers are now funded by private companies. In her book Tainted Truth 
(1994), Cynthia Crossen warns us: 


Private companies, meanwhile, have found it both cheaper and more presti- 
gious to retain academic, government, or commercial researchers than to set 
up in-house operations that some might suspect of fraud. Corporations, liti- 
gants, political candidates, trade associations, lobbyists, special interest 
groups—all can buy research to use as they like. (p. 19) 


If you discover that a study was funded by an organization that would be likely to 
have a strong preference for a particular outcome, it is especially important to be 
sure that correct scientific procedures were followed. In other words, be sure the re- 
maining components have sound explanations. 


Component 2: The researchers who had contact with the participants Itis im- 
portant to know who actually had contact with the participants and what message 
those people conveyed. Participants often give answers or behave in ways to comply 
with the desires of the researchers. Consider, for example, a study done at a shop- 
ping mall to compare a new brand of a certain product to an old familiar brand. 
Shoppers are asked to taste each brand and state their preference. It is crucial that 
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both the person presenting the two brands and the respondents be kept entirely blind 
as to which is which until after the preferences have been selected. Any clues might 
bias the respondent to choose the old familiar brand. Or, if the interviewer is clearly 
eager to have them choose one brand over the other, the respondents will most likely 
oblige in order to please. As another example, if you discovered that a study on the 
prevalence of illegal drug use was conducted by sending uniformed police officers 
door-to-door, you would probably not have much faith in the results. We will discuss 
other ways in which researchers influence participants in Chapters 4 and 5. 


Component 3: The individuals or objects studied and how they were selected It 
is important to know to whom the results can be extended. In general, the results of 
a study apply only to individuals similar to those in the study. For example, until re- 
cently, many medical studies included men only, so the results were of little value to 
women. When determining who is similar to those in the study, it is also important 
to know how participants were enlisted for the study. Many studies rely on volun- 
teers recruited through the newspaper, who are usually paid a small amount for their 
participation. People who would respond to such recruitment efforts may differ in 
relevant ways from those who would not. Surveys relying on voluntary responses are 
likely to be biased because only those who feel strongly about the issues are likely 
to respond. For instance, some Web sites have a “question of the day” to which peo- 
ple are asked to voluntarily respond by clicking on their preferred answer. Only 
those who have strong opinions are likely to participate, so the results cannot be ex- 
tended to any larger group. 


Component 4: The exact nature of the measurements made or questions asked 
As you will see in Chapter 3, precisely defining and measuring most of the things 
researchers study isn’t easy. For example, if you wanted to measure whether people 
“eat breakfast,’ how would you do so? What if they just have juice? What if they 
work until midmorning and then eat a meal that satisfies them until dinner? You need 
to understand exactly what the various definitions mean when you read about some- 
one else’s measurements. 

In polls and surveys, the “measurements” are usually answers to specific ques- 
tions. Both the wording and the ordering of the questions can influence answers. For 
example, a question about “street people” would probably elicit different responses 
than a question about “families who have no home.” Ideally, you should be given the 
exact wording that was used in a survey or poll. 


Component 5: The setting in which the measurements were taken The setting 
in which measurements were taken includes factors such as when and where they 
were taken and whether respondents were contacted by phone, mail, or in person. A 
study can be easily biased by timing. For example, opinions on whether criminals 
should be locked away for life may change drastically following a highly publicized 
murder or kidnapping case. If a study is conducted by telephone and calls are made 
only in the evening, certain groups of people would be excluded, such as those who 
work the evening shift or who routinely eat dinner in restaurants. 
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Where the measurements were taken can also influence the results. Questions 
about sensitive topics, such as sexual behavior or income, might be more readily an- 
swered over the phone, where respondents feel more anonymous. Sometimes re- 
search is done in a laboratory or university office, and the results may not readily 
extend to a natural setting. For example, studies of communication between two peo- 
ple are sometimes done by asking them to conduct a conversation in a university of- 
fice with a tape recorder present. Such conditions almost certainly produce more 
limited conversation than would occur in a more natural setting. 


Component 6: Differences in the groups being compared, in addition to the fac- 
tor of interest If two or more groups are being compared on a factor of interest, it 
is important to consider other ways in which the groups may differ that might influ- 
ence the comparison. For example, suppose researchers want to know if smoking 
marijuana is related to academic performance. If the group of people who smoke 
marijuana has lower test scores than the group of people who don’t, researchers may 
want to conclude that the lower test scores are due to smoking marijuana. Often, 
however, other disparities in the groups can explain the observed difference just as 
well. For example, people who smoke marijuana may simply be the type of people 
who are less motivated to study and thus would score lower on tests whether they 
smoked or not. Reports of research should include an explanation of any such pos- 
sible differences that might account for the results. We will explore the issue of these 
kinds of extraneous factors, and how to control for them, in much more detail in 
Chapter 5. 


Component 7: The extent or size of any claimed effects or differences Media 
reports about statistical studies often fail to tell you how large the observed effects 
were. Without that knowledge, it is hard for you to assess whether you think the re- 
sults are of any practical importance. For example, if, based on Case Study 1.2, you 
were told simply that taking aspirin every other day reduced the risk of heart attacks, 
you would not be able to determine whether it would be worthwhile to take aspirin. 
You should instead be told that for the men in the study, the rate was reduced from 
about 17 heart attacks per 1000 participants without aspirin to about 9.4 heart attacks 
per 1000 with aspirin. Often news reports simply report that a treatment had an ef- 
fect or that a difference was observed, but don’t tell you the size of the difference or 
effect. We will investigate this issue in great detail in Part 4 of this book. 


2.4 Four Hypothetical Examples of Bad Reports 


Throughout this book, you will see numerous examples of real studies and news re- 
ports. So that you can get some practice finding problems without having to read un- 
necessarily long news articles, let’s examine some hypothetical reports. These are 
admittedly more problematic than many real reports because they serve to illustrate 
several difficulties at once. 
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Hypothetical News 
Article 1 


Study Shows Psychology Majors Are 
Smarter than Chemistry Majors 


A fourth-year psychology student, for her 
senior thesis, conducted a study to see if 
students in her major were smarter than 
those majoring in chemistry. She handed 
out questionnaires in five advanced psy- 
chology classes and five advanced chem- 
istry labs. She asked the students who 
were in class to record their grade-point 
averages (GPAs) and their majors. Using 


the data only from those who were actu- 
ally majors in these fields in each set of 
classes, she found that the psychology 
majors had an average GPA of 3.05, 
whereas the chemistry majors had an av- 
erage GPA of only 2.91. The study was 
conducted last Wednesday, the day before 
students were home enjoying Thanksgiv- 
ing dinner. 


Read each article and see if your common sense gives you some reasons why the 
headline is misleading. Then proceed to read the commentary about the Seven Crit- 
ical Components. 


Hypothetical News Article 1: “Study Shows Psychology 

Majors Are Smarter than Chemistry Majors” 

Component 1: The source of the research and of the funding The study was a 
senior thesis project conducted by a psychology major. Presumably, it was cheap to 
run and was paid for by the student. One could argue that she would have a reason 
to want the results to come out as they did, although with a properly conducted study, 
the motives of the experimenter should be minimized. As we shall see, there were 
additional problems with this study. 


Component 2: The researchers who had contact with the participants Presum- 
ably, only the student conducting the study had contact with the respondents. Cru- 
cial missing information is whether she told them the purpose of the study. Even if 
she did not tell them, many of the psychology majors may have known her and 
known what she was doing. Any clues as to desired outcomes on the part of experi- 
menters can bias the results. 


Component 3: The individuals or objects studied and how they were selected 
The individuals selected are the crux of the problem here. The measurements were 
taken on advanced psychology and chemistry students, which would have been fine 
if they had been sampled correctly. However, only those who were in the psychol- 
ogy classes or in the chemistry labs that day were actually measured. Less consci- 
entious students are more likely to leave early before a holiday, but a missed class is 
probably easier to make up than a missed lab. Therefore, perhaps a larger proportion 
of the students with low grade-point averages were absent from the psychology 
classes than from the chemistry labs. Due to the missing students, the investigator’s 
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results would overestimate the average GPA for psychology students more so than 
for chemistry students. 


Component 4: The exact nature of the measurements made or questions asked 
Students were asked to give a “self-report” of their grade-point averages. A more 
accurate method would have been to obtain this information from the registrar at 
the university. Students may not know their exact grade-point average. Also, one 
group may be more likely to know the exact value than the other. For example, 
if many of the chemistry majors were planning to apply to medical school in the 
near future, they may be only too aware of their grades. Further, the headline im- 
plies that GPA is a measure of intelligence. Finally, the research assumes that GPA 
is a standard measure. Perhaps grading is more competitive in the chemistry 
department. 


Component 5: The setting in which the measurements were taken Notice that 
the article specifies that the measurements were taken on the day before a major hol- 
iday. Unless the university consisted mainly of commuters, many students may have 
left early for the holiday, further aggravating the problem that the students with 
lower grades were more likely to be missing from the psychology classes than from 
the chemistry labs. Further, because students turned in their questionnaires anony- 
mously, there was presumably no accountability for incorrect answers. 


Component 6: Differences in the groups being compared, in addition to the fac- 
tor of interest The factor of interest is the student’s major, and the two groups be- 
ing compared are psychology majors and chemistry majors. This component 
considers whether the students who were interviewed for the study may differ in 
ways other than their choice of major. It is difficult to know what differences might 
exist without knowing more about the particular university. For example, because 
psychology is such a popular major, at some universities students are required to 
have a certain GPA before they are admitted to the major. A university with a sepa- 
rate premedical major might have the best of the science students enrolled in that 
major instead of chemistry. Those kinds of extraneous factors would be relevant to 
interpreting the results of the study. 


Component 7: The extent or size of any claimed effects or differences The news 
report does present this information, by noting that the average GPAs for the two 
groups were 3.05 and 2.91. Additional useful information would be to know how 
many students were included in each of the averages given, what percentage of all 
students in each major were represented in the sample, and how much variation there 
was among GPAs within each of the two groups. 


Hypothetical News Article 2: “Per Capita Income 

of U.S. Shrinks Relative to Other Countries” 

Component 1: The source of the research and of the funding We are told noth- 
ing except the name of the group that conducted the study, which should be fair 
warning. Being called “an independent research group” in the story does not mean 


24 PART 1 Finding Data in Life 


Hypothetical News 
Article 2 


Per Capita Income of U.S. Shrinks 


Relative to Other Countries 


An independent research group, the In- 
stitute for Foreign Investment, has noted 
that the per capita income of Americans 
has been shrinking relative to some other 
countries. Using per capita income fig- 
ures from the World Almanac and ex- 
change rates from last Friday’s financial 
pages, the organization warned that per 


capita income for the United States has 
risen only 10% during the past 5 years, 
whereas per capita income for certain 
other countries has risen 50%. The re- 
searchers concluded that more foreign 
investment should be allowed in the 
United States to bolster the sagging 
economy. 


that it is an unbiased research group. In fact, the last line of the story illustrates the 
probable motive for their research. 


Component 2: The researchers who had contact with the participants This 
component is not relevant because there were no participants in the study. 


Component 3: The individuals or objects studied and how they were selected 
The objects in this study were the countries used for comparison with the United 
States. We should have been told which countries were used, and why. 


Component 4: The exact nature of the measurements made or questions asked 
This is the major problem with this study. First, as mentioned, we are not even told 
which countries were used for comparison. Second, current exchange rates but 
older per capita income figures were used. If the rate of inflation in a country had 
recently been very high, so that a large rise in per capita income did not reflect a 
concomitant rise in spending power, then we should not be surprised to see a large 
increase in per capita income in terms of actual dollars. In order to make a valid 
comparison, all figures would have to be adjusted to comparable measures of 
spending power, taking inflation into account. We will learn how to do that in 
Chapter 14. 


Components 5, 6, and 7: The setting in which the measurements were taken. 
Differences in the groups being compared, in addition to the factor of interest. 
The extent or size of any claimed effects or differences These issues are not rel- 
evant here, except as they have already been discussed. For example, although the 
size of the difference between the United States and the other countries is reported, 
it is meaningless without an inflation adjustment. 


Hypothetical News 
Article 3 


CHAPTER 2 Reading the News 


Researchers Find Drug to Cure 
Excessive Barking in Dogs 


Barking dogs can be a real problem, as 
anyone who has been kept awake at night 
by the barking of a neighbor’s canine 
companion will know. Researchers at a lo- 
cal university have tested a new drug that 
they hope will put all concerned to rest. 
Twenty dog owners responded to a news- 
paper article asking for volunteers with 
problem barking dogs to participate in a 
study. The dogs were randomly assigned 
to two groups. One group of dogs was 
given the drug, administered as a shot, and 
the other dogs were not. Both groups were 
kept overnight at the research facility and 


frequency of barking was observed. The 
researchers deliberately tried to provoke 
the dogs into barking by doing things like 
ringing the doorbell of the facility and 
having a mail carrier walk up to the door. 
The two groups were treated on separate 
weekends because the facility was only 
large enough to hold ten dogs. The re- 
searchers left a tape recorder running and 
measured the amount of time during 
which any barking was heard. The dogs 
who had been given the drug spent only 
half as much time barking as did the dogs 
in the control group. 
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Hypothetical News Article 3: “Researchers Find 

Drug to Cure Excessive Barking in Dogs” 

Component 1: The source of the research and of the funding We are not told 
why this study was conducted. Presumably it was because the researchers were in- 
terested in helping to solve a societal problem, but perhaps not. It is not uncommon 
for drug companies to fund research to test a new product or a new use for a current 
product. If that were the case, the researchers would have added incentive for the re- 
sults to come out favorable to the drug. If everything were done correctly, such an 
incentive wouldn’t be a major factor; however, when research is funded by a private 
source, that information should be announced when the results are announced. 


Component 2: The researchers who had contact with the participants We are 
not given any information about who actually had contact with the dogs. One im- 
portant question is whether the same handlers were used with both groups of dogs. 
If not, the difference in handlers could explain the results. Further, we are not told 
whether the dogs were primarily left alone or were attended most of the time. If re- 
searchers were present most of the time, their behavior toward the dogs could have 
had a major impact on the amount of barking. 


Component 3: The individuals or objects studied and how they were selected 
We are told that the study used dogs whose owners volunteered them as problem 
dogs for the study. Although the report does not mention payment, it is quite 
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common for volunteers to receive monetary compensation for their participation. 
The volunteers presumably lived in the area of the university. The dog owners had 
to be willing to be separated from their pets for the weekend. These and other fac- 
tors mean that the owners and dogs who participated may differ from the general 
population. Further, the initial reasons for the problem behavior may vary from one 
participant to the next, yet the dogs were measured together. Therefore, there is no 
way to ascertain if, for example, dogs who bark only because they are lonely would 
be helped. In any case, we cannot extend the results of this study to conclude that 
the drug would work similarly on all dogs or even on all problem dogs. Because the 
dogs were randomly assigned to the two groups—and if there were no other prob- 
lems—we would be able to extend the results to all dogs similar to those who 
participated. 


Component 4: The exact nature of the measurements made or questions asked 
The researchers measured each group of dogs as a group, by listening to a tape and 
recording the amount of time during which there was any barking. Because dogs are 
quite responsive to group behavior, one barking dog could set the whole group bark- 
ing for a long time. Therefore, just one particularly obnoxious dog in the control 
group alone could explain the results. It would have been better to separate the dogs 
and measure each one individually. 


Component 5: The setting in which the measurements were taken The groups 
were measured on separate weekends. This creates another problem. First, the re- 
searchers knew which group was which and may have unconsciously provoked the 
control group slightly more than the group receiving the drug. Further, conditions 
differed over the two weeks. Perhaps it was sunny one weekend and raining the next, 
or there were other subtle differences, such as more traffic one weekend than the 
next, small planes overhead, and so on. All of these could change the behavior of the 
dogs but might go unnoticed or unreported by the experimenters. 

The measurements were also taken outside of the dogs’ natural environments. 
The dogs in the experimental group in particular would have reason to be upset be- 
cause they were first given a shot and then put together with nine other dogs in the 
research facility. It would have been better to put them back into their natural envi- 
ronment because that’s where the problem barking was known to occur. 


Component 6: Differences in the groups being compared, in addition to the fac- 
tor of interest The dogs were randomly assigned to the two groups (drug or no 
drug), which should have minimized overall differences in size, temperament, and 
so on for the dogs in the two groups. However, differences were induced between the 
two groups by the way the experiment was conducted. Recall that the groups were 
measured on different weekends—this could have created the difference in behavior. 
Also, the drug-treated dogs were given a shot to administer the drug, whereas the 
control group was given no shot. It could be that the very act of getting a shot made 
the drug group lethargic. A better design would have been to administer a placebo 
shot—that is, a shot with an inert substance—to the control group. 


Hypothetical News 
Article 4 


CHAPTER 2 Reading the News 


Survey Finds Most Women Unhappy 
in Their Choice of Husbands 


A popular women’s magazine, in a survey 
of its subscribers, found that over 90% of 
them are unhappy in their choice of 
whom they married. Copies of the survey 
were mailed to the magazine’s 100,000 
subscribers. Surveys were returned by 
5000 readers. Of those responding, 4520, 
or slightly over 90%, answered no to the 
question: “If you had it to do over again, 
would you marry the same man?” To keep 
the survey simple so that people would re- 
turn it, only two other questions were 
asked. The second question was, “Do you 
think being married is better than being 


single?” Despite their unhappiness with 
their choice of spouse, 70% answered yes 
to this. The final question, “Do you think 
you will outlive your husband?” received 
a yes answer from 80% of the respon- 
dents. Because women generally live 
longer than men, and tend to marry men 
somewhat older than themselves, this re- 
sponse was not surprising. The magazine 
editors were at a loss to explain the huge 
proportion of women who would choose 
differently. The editor could only specu- 
late: “I guess finding Mr. Right is much 
harder than anyone realized.” 
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Component 7: The extent or size of any claimed effects or differences We are 
told only that the treated group barked half as much as the control group. We are not 
told how much time either group spent barking. If one group barked 8 hours a day 
but the other group only 4 hours a day, that would not be a satisfactory solution to 
the problem of barking dogs. 


Hypothetical News Article 4: “Survey Finds Most Women 

Unhappy in Their Choice of Husbands” 

Components 1 through 7 We don’t even need to consider the details of this 
study because it contains a fatal flaw from the outset. The survey is an example of 
what is called a “volunteer sample” or a “self-selected sample.” Of the 100,000 
who received the survey, only 5% responded. The people who are most likely to re- 
spond to such a survey are those who have a strong emotional response to the ques- 
tion. In this case, it would be women who are unhappy with their current situation 
who would probably respond. Notice that the other two questions are more general 
and thus not likely to arouse much emotion either way. Thus, it is the strong reac- 
tion to the first question that would drive people to respond. The results would cer- 
tainly not be representative of “most women” or even of most subscribers to the 
magazine. 
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CASE STUDY 2.1 


Who Suffers from Hangovers? 
Source: News Story 2 in the Appendix and Original Source 2 on the CD. 


Read News Story 2 in the Appendix, “Research shows women harder hit by hang- 
overs” and access the original source of the story on the CD, the journal article “De- 
velopment and Initial Validation of the Hangover Symptoms Scale: Prevalence and 
Correlates of Hangover Symptoms in College Students.” Let’s examine the seven 
critical components based on the news story and, where necessary, additional infor- 
mation provided in the journal article. 


Component 1: The source of the research and of the funding The news story 
covers this aspect well. The researchers were “a team at the University of Missouri- 
Columbia” and the study was “supported by the National Institutes of Health.” 


Component 2: The researchers who had contact with the participants This as- 
pect of the study is not clear from the news article, which simply mentions, “The re- 
searchers asked 1,230 drinking college students .. . ”’ However, the journal article 
says that the participants were enrolled in Introduction to Psychology courses and 
were asked to fill out a questionnaire. So it can be assumed that professors or re- 
search assistants in psychology had contact with the participants. 


Component 3: The individuals or objects studied and how they were selected 
The news story describes the participants as “1,230 drinking college students, only 
5 percent of whom were of legal drinking age.” The journal article provides much 
more information, including the important fact that the participants were all enrolled 
in introductory psychology classes and were participating in the research to fulfill a 
requirement for the course. The reader must decide whether this group of partici- 
pants is likely to be representative of all drinking college students, or some larger 
population, for severity of hangover symptoms. The journal article also provides in- 
formation on the sex, ethnicity, and age of participants. 


Component 4: The exact nature of the measurements made or questions asked 
The news story provides some detail about what was asked, noting that the partici- 
pants were asked “to describe how often they experienced any of 13 symptoms after 
drinking. The symptoms ranged from headaches and vomiting to feeling weak and 
unable to concentrate.” The journal article again provides much more detail, listing 
the 13 symptoms and explaining that participants were asked to indicate how often 
they were experienced on a 5-point scale (p. 1444 of the journal article). Further, par- 
ticipants were asked to provide a “hangover count” in which they noted how many 
times they had experienced at least one of the 13 symptoms in the past year, using a 
5-point scale. This scale ranged from “never” to “52 times or more.” Additional 
questions were asked about alcoholism in the participant’s family and early experi- 
ence with alcohol. Detailed information about all of these questions is included in 
the journal article. 
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Component 5: The setting in which the measurements were taken This infor- 
mation is not provided explicitly, but it can be assumed that measurements were 
taken in the Psychology Department at the University of Columbia-Missouri. One 
missing fact that may be helpful in interpreting the results is if the questions were 
administered to a large group of students at once, or individually, and whether stu- 
dents could be identified when the researchers read their responses. 


Component 6: Differences in the groups being compared, in addition to the fac- 
tor of interest The purpose of the research was to develop and test a “Hangover 
Symptoms Scale” but two interesting differences in groups emerged when the re- 
searchers made comparisons. The groups being compared in the first instance were 
males and females; thus, Male/Female was the factor of interest. The researchers 
found that females suffered more from hangovers. This component is asking if there 
may be other differences between males and females, other than “Male” and “Fe- 
male” that could help account for the difference. One possibility mentioned in the 
news article is body weight. Males tend to weigh more than females on average. An 
interesting question, not answered by the research, is if a group of males and females 
of the same weight, say 130 pounds, were to consume the same amount of alcohol, 
would the females suffer more hangover symptoms? The difference in weight be- 
tween the two groups is in addition to the factor of interest, which is Male/Female. 
It may be the weight difference, and not the sex difference, that accounts for the dif- 
ference in hangover severity. 

The other comparison mentioned in the news article is between students who had 
alcohol-related problems, or whose biological parents had such problems, and students 
who did not have that history. In this case, the alcohol-related problems (of the student 
or parents) is the factor of interest. However, you can probably think of other differ- 
ences in the two groups (those with problems and those without) that may help account 
for the difference in hangover severity between the two groups. For instance, students 
with a history of problems may not have as healthful diets in the past or present as stu- 
dents without such problems, and that may contribute to hangover severity. So the 
comparison of interest, between those with an alcohol problem in their background 
and those without, may be complicated by other differences in these two groups. 


Component 7: The extent or size of any claimed effects or differences The news 
story does not report how much difference in hangover severity was found between 
men and women, or between those with and without a history of alcohol problems. 
Reading the journal article may explain why this is so—the article itself does not re- 
port a simple difference. In fact, simple comparisons don’t yield much difference; 
for instance, 11% of men and 14% of women never experienced any hangover symp- 
toms in the previous year. Differences only emerged when complicating factors such 
as amount of alcohol consumed were factored in. The researchers report, “After con- 
trolling for the frequency of drinking and getting drunk and for the typical quantity 
of alcohol consumed when drinking, women were significantly more likely than men 
to experience at least one of the hangover symptoms” (p. 1446). The article does not 
elaborate, such as explaining what would be the difference for a male and female 
who drank the same amount and equally often. 
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2.5 Planning Your Own Study: 
Defining the Components in Advance 


Although you may never have to design your own survey or experiment, it will help 
you understand how difficult it can be if we illustrate the Seven Critical Components 
for a very simple hypothetical study you might want to conduct. Suppose you are in- 
terested in determining which of three local supermarkets has the best prices so you 
can decide where to shop. Because you obviously can’t record and summarize the 
prices for all available items, you would have to use some sort of sample. 

To obtain meaningful data, you would need to make many decisions. Some of 
the components need to be reworded because they are being answered in advance of 
the study, and obviously not all of the components are relevant for this simple ex- 
ample. However, by going through them for such a simple case, you can see how 
many ambiguities and decisions can arise when designing a study. 


Component 1: The source of the research and of the funding Presumably you 
would be funding the study yourself, but before you start you need to decide why 
you are doing the study. Are you only interested in items you routinely buy, or are 
you interested in comparing the stores on the multitude of possible items? 


Component 2: The researchers who had contact with the participants In this 
example, the question would be who is going to visit the stores and record the prices. 
Will you personally visit each store and record the prices? Will you send friends to 
two of the stores and visit the third yourself? If you use other people, you would 
need to train them so there would be no ambiguities. 


Component 3: The individuals or objects studied and how they were selected 
In this case, the “objects studied” are items in the grocery store. The correct ques- 
tion is, “On what items should prices be recorded?” Do you want to use exactly the 
same items at all stores? What if one store offers its own brand but another only of- 
fers name brands? Do you want to choose a representative sampling of items you are 
likely to buy or choose from all possible items? Do you want to include nonfood 
items? How many items should you include? How should you choose which ones to 
select? If you are simply trying to minimize your own shopping bill, it is probably 
best to list the 20 or 30 items you buy most often. However, if you are interested in 
sharing your results with others, you might prefer to choose a representative sample 
of items from a long list of possibilities. 


Component 4: The exact nature of the measurements made or questions asked 
You may think that the cost of an item in a supermarket is a well-defined measure- 
ment. But if a store is having a sale on a particular item on your list, should you use 
the sale price or the regular price? Should you use the price of the smallest possible 
size of the product? The largest? What if a store always has a sale on one brand or 
another of something, such as laundry soap, and you don’t really care which brand 
you buy? Should you then record the price of the brand on sale that week? Should 
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you record the prices listed on the shelves or actually purchase the items and see if 
the prices listed were accurate? 


Component 5: The setting in which the measurements were taken When will 
you conduct the study? Supermarkets in university towns may offer sale prices on 
items typically bought by students at certain times of the year—for example, just af- 
ter students have returned from vacation. Many stores also offer sale items related to 
certain holidays, such as ham or turkey just before Christmas or eggs just before 
Easter. Should you take that kind of timing into account? 


Component 6: Differences in the groups being compared, in addition to the fac- 
tor of interest The groups being compared are the groups of items from the three 
stores. There should be no additional differences related to the direct costs of the 
items. However, if you were conducting the study in order to minimize your shop- 
ping costs, you might ask if there are hidden costs for shopping at one store versus 
another. For example, do you always have to wait in line at one store and not at an- 
other, and should you therefore put a value on your time? Does one store make mis- 
takes at the cash register more often than another? Does one store charge a higher 
fee to use your cash card for payment? Does it cost more to drive to one store than 
another? 


Component 7: The extent or size of any claimed effects or differences This 
component should enter into your decision about where to shop after you have fin- 
ished the study. Even if you find that items in one store cost less than in another, the 
amount of the difference may not convince you to shop there. You would probably 
want to figure out approximately how much shopping in a particular store would 
save you over the course of a year. You can see why knowing the amount of a dif- 
ference found in a study is an important component for using that study to make fu- 
ture decisions. 


Brooks Shoes Brings Flawed Study to Court 
Source: Gastwirth (1988) pp. 517-520. 


In 1981, Brooks Shoe Manufacturing Company sued Suave Shoe Corporation for 
manufacturing shoes incorporating a “V” design used in Brooks’s athletic shoes. 
Brooks claimed that the design was an unregistered trademark that people used to 
identify Brooks shoes. According to Gastwirth (1988, p. 517), it was the role of the 
court to determine “the distinctiveness or strength of the mark as well as its possible 
secondary meaning (similarity of product or mark might confuse prospective pur- 
chasers of the source of the item).” 

To show that the design had “secondary meaning” to buyers, Brooks conducted 
a survey of 121 spectators and participants at three track meets. Interviewers ap- 
proached people and asked them a series of questions that included showing them a 
Brooks shoe with the name masked and asking them to identify it. Of those sur- 
veyed, 71% were able to identify it as a Brooks shoe, and 33% of those people said 
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it was because they recognized the “V.” When shown a Suave shoe, 39% of them 
thought it was a Brooks shoe, with 48% of those people saying it was because of the 
“V” design on the Suave shoe. Brooks Company argued that this was sufficient evi- 
dence that people might be confused and think Suave shoes were manufactured by 
Brooks. 

Suave had a statistician as an expert witness who pointed out a number of flaws 
in the Brooks survey. Let’s examine them using the Seven Critical Components as a 
guide. First, the survey was funded and conducted by Brooks, and the company’s 
lawyer was instrumental in designing it. Second, the court determined that the inter- 
viewers who had contact with the respondents were inadequately trained in how to 
conduct an unbiased survey. Third, the individuals asked were not selected to be rep- 
resentative of the general public in the area (Baltimore/Washington, D.C.). For ex- 
ample, 78% had some college education, compared with 18.4% in Baltimore and 
37.7% in Washington, D.C. Further, the settings for the interviews were track meets, 
where people were likely to be more familiar with athletic shoes. The questions 
asked were biased. For example, the exact wording used when a person was handed 
the shoes was: “I am going to hand you a shoe. Please tell me what brand you think 
it is.’ The way the question is framed would presumably lead respondents to think 
the shoe has a well-known brand name. Later in the questioning, respondents were 
asked, “How long have you known about Brooks Running Shoes?” Because of the 
setting, respondents could have informed others at the track meet that Brooks was 
probably conducting the survey, and those informed could have subsequently been 
interviewed. 

Suave introduced its own survey conducted on 404 respondents properly sam- 
pled from the population of all people who had purchased any type of athletic shoe 
during the previous year. Of those, only 2.7% recognized a Brooks shoe on the ba- 
sis of the “V” design. The combination of the poor survey methods by Brooks and 
the proper survey by Suave convinced the court that the public did not make enough 
of an association between Brooks and the “V” design to allow Brooks to claim legal 
rights to the design. 


Exercises 


Asterisked (*) exercises are included in the Solutions at the back of the book. 


1. Suppose that a television network wants to know how daytime television view- 
ers feel about a new soap opera the network is broadcasting. A staff member 
suggests that just after the show ends they give two phone numbers, one for 
viewers to call if they like the show and the other to call if they don’t. Give two 
reasons why this method would not produce the desired information. (Hint: The 
network is interested in all daytime television viewers. Who is likely to be 
watching just after the show, and who is likely to call in?) 


2. The April 24, 1997, issue of “UCDavis Lifestyle Newstips” reported that a pro- 
fessor of veterinary medicine was conducting a study to see if a drug called 
clomipramine, an anti-anxiety medication used for humans, could reduce “ca- 
nine aggression toward family members.” The newsletter said, “Dogs demon- 


*3. 


*5, 


*Ts 


*10. 
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strating this type of aggression are needed to participate in the study. . . . Half of 
the participating dogs will receive clomipramine, while the others will be given 
a placebo.” A phone number was given for dog owners to call to volunteer their 
dogs for the study. To what group could the results of this study be applied? 
Explain. 


A prison administration wants to know whether the prisoners think the guards 
treat them fairly. Explain how each of the following components could be used 
to produce biased results, versus how each could be used to produce unbiased 
results: 


*a. Component 2: The researchers who had contact with the participants. 


b. Component 4: The exact nature of the measurements made or questions 
asked. 


According to Cynthia Crossen (1994, p. 106): “It is a poller’s business to press 
for an opinion whether people have one or not. ‘Don’t knows’ are worthless to 
pollers, whose product is opinion, not ignorance. That’s why so many polls do 
not even offer a ‘don’t know’ alternative.” 


a. Explain how this problem might lead to bias in a survey. 
b. Which of the Seven Critical Components would bring this problem to light? 


The student who conducted the study in “Hypothetical News Article 1” in this 
chapter collected two pieces of data from each participant. What were the two 
pieces of data? 


Many research organizations give their interviewers an exact script to follow 
when conducting interviews to measure opinions on controversial issues. Why 
do you think they do so? 


Is it necessary that “data” consist of numbers? Explain. 


Refer to Case Study 1.1, “Heart or Hypothalamus?” Discuss each of the fol- 
lowing components, including whether you think the way it was handled would 
detract from Salk’s conclusion: 


a. Component 3 
b. Component 4 
c. Component 5 
d. Component 6 


Suppose a tobacco company is planning to fund a telephone survey of attitudes 
about banning smoking in restaurants. In each of the following phases of the 
survey, should the company disclose who is funding the study? Explain your an- 
swer in each case. 


a. When respondents answer the phone, before they are interviewed. 
b. When the survey results are reported in the news. 
c. When the interviewers are trained and told how to conduct the interviews. 


Suppose a study were to find that twice as many users of nicotine patches quit 
smoking than nonusers. Suppose you are a smoker trying to quit. Which version 
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14. 


*15, 


16. 


of an answer to each of the following components would be more compelling 
evidence for you to try the nicotine patches? Explain. 


*a. Component 3. Version 1 is that the nicotine patch users were lung cancer pa- 
tients, whereas the nonusers were healthy. Version 2 is that participants were 
randomly assigned to use the patch or not after answering an advertisement 
in the newspaper asking for volunteers who wanted to quit smoking. 


*b. Component 7. Version 1 is that 25% of nonusers quit, whereas 50% of users 
quit. Version 2 is that 1% of nonusers quit, whereas 2% of users quit. 


In most studies involving human participants, researchers are required to fully 
disclose the purpose of the study to the participants. Do you think people should 
always be informed about the purpose before they participate? Explain. 


Explain why news reports should give the extent or size of the claimed effects 
or differences from a study instead of just reporting that an effect or difference 
was found. 


Suppose a study were to find that drinking coffee raised cholesterol levels. Fur- 
ther, suppose you drink two cups of coffee a day and have a family history of 
heart problems related to high cholesterol. Pick three of the Seven Critical Com- 
ponents and discuss why knowledge of them would be useful in terms of decid- 
ing whether to change your coffee-drinking habits based on the results of the 
study. 


Holden (1991, p. 934) discusses the methods used to rank high school math per- 
formance among various countries. She notes that “According to the Interna- 
tional Association for the Evaluation of Educational Achievement, Hungary 
ranks near the top in 8th-grade math achievement. But by the 12th grade, the 
country falls to the bottom of the list because it enrolls more students than any 
other country—50%—in advanced math. Hong Kong, in contrast, comes in 
first, but only 3% of its 12th graders take math.” Explain why answers to Com- 
ponents 3 and 6 would be most useful when interpreting the results of rankings 
of high school math performance in various countries, and describe how your 
interpretation of the results would be affected by knowing the answers. 


Moore (1991, p. 19) reports the following contradictory evidence: “The advice 
columnist Ann Landers once asked her readers, ‘If you had it to do over again, 
would you have children? She received nearly 10,000 responses, almost 70% 
saying ‘No!’...A professional nationwide random sample commissioned by 
Newsday ... polled 1373 parents and found that 91% would have children 
again.” Using the most relevant one of the Seven Critical Components, explain 
the contradiction in the two sets of answers. 

An advertisement for a cross-country ski machine, NordicTrack, claimed, “In 
just 12 weeks, research shows that people who used a NordicTrack lost an av- 
erage of 18 pounds.” Explain how each of the following components should 
have been addressed if the research results are fair and unbiased. 

a. Component 3: The individuals or objects studied and how they were selected. 


b. Component 4: The exact nature of the measurements made or questions 
asked. 
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c. Component 5: The setting in which the measurements were taken. 


d. Component 6: Differences in the groups being compared, in addition to the 
factor of interest. 


Mini-Projects 


. Scientists publish their findings in technical magazines called journals. Most 


university libraries have hundreds of journals available for browsing, many of 
them accessible electronically. Find out where the medical journals are located. 
Browse the shelves or electronic journals until you find an article with a study 
that sounds interesting to you. (The New England Journal of Medicine and the 
Journal of the American Medical Association often have articles of broad inter- 
est, but there are also numerous specialized journals on pediatrics, cancer, 
AIDS, and so on.) Read the article and write a report that discusses each of the 
Seven Critical Components for that particular study. Argue for or against the be- 
lievability of the results on the basis of your discussion. Be sure you find an ar- 
ticle discussing a single study and not a collection or “meta-analysis” of 
numerous studies. 


. Explain how you would design and carry out a study to find out how students at 


your school feel about an issue of interest to you. Be explicit enough that some- 
one would actually be able to follow your instructions and implement the study. 
Be sure to consider each of the Seven Critical Components when you design and 
explain how to do the study. 


. Find an example of a statistical study reported in the news for which informa- 


tion about one of the Seven Critical Components is missing. Write two hypo- 
thetical reports addressing the missing component that would lead you to two 
different conclusions about the applicability of the results of the study. 
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Measurements, Mistakes, 
and Misunderstandings 


Thought Questions 


1. 


Suppose you were interested in finding out what people felt to be the most impor- 
tant problem facing society today. Do you think it would be better to give them a 
fixed set of choices from which they must choose or an open-ended question that al- 
lowed them to specify whatever they wished? What would be the advantages and 
disadvantages of each approach? 


. You and a friend are each doing a survey to see if there is a relationship between 


height and happiness. Without discussing in advance how you will do so, you both 
attempt to measure the height and happiness of the same 100 people. Are you more 
likely to agree on your measurement of height or on your measurement of happi- 
ness? Explain, discussing how you would measure each characteristic. 


. A newsletter distributed by a politician to his constituents gave the results of a “na- 


tionwide survey on Americans’ attitudes about a variety of educational issues.” One 
of the questions asked was, “Should your legislature adopt a policy to assist children 
in failing schools to opt out of that school and attend an alternative school—public, 
private, or parochial—of the parents’ choosing?” From the wording of this question, 
can you speculate on what answer was desired? Explain. 


. You are at a swimming pool with a friend and become curious about the width of 


the pool. Your friend has a 12-inch ruler, with which he sets about measuring the 
width. He reports that the width is 15.771 feet. Do you believe the pool is exactly 
that width? What is the problem? (Note that .771 feet is 9 1/4 inches.) 


. If you were to have your intelligence, or IQ, measured twice using a standard IQ test, 


do you think it would be exactly the same both times? What factors might account 
for any changes? 
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3.1 Simple Measures Don’t Exist 


In the last chapter, we listed Seven Critical Components that need to be considered 
when someone conducts a study. You saw that many decisions need to be made and 
many potential problems can arise when you try to use data to answer a question. 
One of the hardest decisions is contained in Component 4—that is, in deciding ex- 
actly what to measure or what questions to ask. In this chapter, we focus on prob- 
lems with defining measurements and on the subsequent misunderstandings and 
mistakes that can result. When you read the results of a study, it is important that you 
understand exactly how the information was collected and what was measured or 
asked. Consider something as apparently simple as trying to measure your own 
height. Try it a few times and see if you get the measurement to within a quarter of 
an inch from one time to the next. Now imagine trying to measure something much 
more complex, such as the amount of fat in someone’s diet or the degree of happi- 
ness in someone’s life. Researchers routinely attempt to measure these kinds of 
factors. 


3.2 It’s All in the Wording 


You may be surprised at how much answers to questions can change based on sim- 
ple changes in wording. Here are two examples. 


EXAMPLE 1 How Fast Were They Going? 


Loftus and Palmer (1974; quoted in Plous, 1993, p. 32) showed college students films 
of an automobile accident, after which they asked them a series of questions. One 
group was asked the question: “About how fast were the cars going when they con- 
tacted each other?” The average response was 31.8 miles per hour. Another group was 
asked: “About how fast were the cars going when they collided with each other?” In 
that group, the average response was 40.8 miles per hour. Simply changing from the 
word contacted to the word collided increased the estimates of speed by 9 miles per 
hour, or 28%, even though the respondents had witnessed the same film. E 


EXAMPLE 2 Is Marijuana Easy to Buy but Hard to Get? 
g Refer to the detailed report on the CD labeled as Original Source 13: “2003 CASA Na- 
tional Survey of American Attitudes on Substance Abuse VIII: Teens and Parents,” which 
wos describes a survey of teens and drug use. One of the questions (number 36, p. 44) asked 
teens about the relative ease of getting cigarettes, beer, and marijuana. About half of 
the teens were asked about “buying” these items and the other half about “obtaining” 
them. The questions and percent giving each response were: 


“Which is easiest for someone of your age to buy: cigarettes, beer or marijuana?" 
“Which is easiest for someone of your age to obtain: cigarettes, beer or marijuana?" 
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Response Version with “buy” Version with “obtain” 
Cigarettes 35% 39% 
Beer 18% 27% 
Marijuana 34% 19% 
The Same 4% 5% 
Don’t know/no response 9% 10% 


Notice that the responses indicate that beer is easier to “obtain” than is marijuana, but 
marijuana is easier to “buy” than beer. The subtle difference in wording reflects a very 
important difference in real life. Regulations and oversight authorities have made it dif- 
ficult for teenagers to buy alcohol, but not to obtain it in other ways. E 


Many pitfalls can be encountered when asking questions in a survey or 
experiment. Here are some of them; each will be discussed in turn: 

. Deliberate bias 

. Unintentional bias 

. Desire to please 

Asking the uninformed 

. Unnecessary complexity 

. Ordering of questions 


IAM BP wWHN 


. Confidentiality and anonymity 


Deliberate Bias 


Sometimes, if a survey is being conducted to support a certain cause, questions are 
deliberately worded in a biased manner. Be careful about survey questions that be- 
gin with phrases like “Do you agree that. . . >” Most people want to be agreeable and 
will be inclined to answer yes unless they have strong feelings the other way. For ex- 
ample, suppose an anti-abortion group and a pro-choice group each wanted to con- 
duct a survey in which they would find the best possible agreement with their 
position. Here are two questions that would each produce an estimate of the propor- 
tion of people who think abortion should be completely illegal. Each question is al- 
most certain to produce a different estimate: 


1. Do you agree that abortion, the murder of innocent beings, should be outlawed? 
2. Do you agree that there are circumstances under which abortion should be legal, 


to protect the rights of the mother? 


Appropriate wording should not indicate a desired answer. For instance, a Gallup 
Poll conducted in June 1998 contained the question “Do you think it was a good 
thing or a bad thing that the atomic bomb was developed?” Notice that the question 
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does not indicate which answer is preferable. In case you’re curious, 61% of the re- 
spondents said “bad,” whereas 36% said “good” and 3% were undecided. 


Unintentional Bias 


Sometimes questions are worded in such a way that the meaning is misinterpreted 
by a large percentage of the respondents. For example, if you were to ask people 
whether they use drugs, you would need to specify if you mean prescription drugs, 
illegal drugs, over-the-counter drugs, or common substances such as caffeine. If you 
were to ask people to recall the most important date in their life, you would need to 
clarify if you meant the most important calendar date or the most important social 
engagement with a potential partner. (It is unlikely that anyone would mistake the 
question as being about the shriveled fruit, but you can see that the same word can 
have multiple meanings.) 


Desire to Please 


Most survey respondents have a desire to please the person who is asking the ques- 
tion. They tend to understate their responses about undesirable social habits and 
opinions, and vice versa. For example, in recent years estimates of the prevalence of 
cigarette smoking based on surveys do not match those based on cigarette sales. Ei- 
ther people are not being completely truthful or lots of cigarettes are ending up in 
the garbage. 


Asking the Uninformed 


People do not like to admit that they don’t know what you are talking about when 
you ask them a question. Crossen (1994, p. 24) gives an example: “When the Amer- 
ican Jewish Committee studied Americans’ attitudes toward various ethnic groups, 
almost 30% of the respondents had an opinion about the fictional Wisians, rating 
them in social standing above a half-dozen other real groups, including Mexicans, 
Vietnamese, and African blacks.” Political pollsters, who are interested in surveying 
only those who will actually vote, learned long ago that it is useless to simply ask 
people if they plan to vote. Most of them will say yes. Instead, they ask questions to 
establish a history of voting, such as “Where did you go to vote in the last election?” 


Unnecessary Complexity 


If questions are to be understood, they must be kept simple. A question such as 
“Shouldn’t former drug dealers not be allowed to work in hospitals after they are re- 
leased from prison?” is sure to lead to confusion. Does a yes answer mean they 
should or they shouldn’t be allowed to work in hospitals? It would take a few read- 
ings to figure that out. 

Another way in which a question can be unnecessarily complex is to actually ask 
more than one question at once. An example would be a question such as “Do you 
support the president’s health care plan because it would ensure that all Americans 
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receive health coverage?” If you agree with the idea that all Americans should re- 
ceive health coverage, but disagree with the remainder of the plan, do you answer 
yes or no? Or what if you support the president’s plan, but not for that reason? 


Ordering of Questions 


If one question requires respondents to think about something that they may not have 
otherwise considered, then the order in which questions are presented can change the 
results. For example, suppose a survey were to ask, “To what extent do you think 
teenagers today worry about peer pressure related to drinking alcohol?” and then 
ask, “Name the top five pressures you think face teenagers today.” It is quite likely 
that respondents would use the idea they had just been given and name peer pressure 
related to drinking alcohol as one of the five choices. 


Confidentiality and Anonymity 


People sometimes answer questions differently based on the degree to which they 
believe they are anonymous. Because researchers often need to perform follow-up 
surveys, it is easier to try to ensure confidentiality than true anonymity. In ensuring 
confidentiality, the researcher promises not to release identifying information about 
respondents. In a truly anonymous survey, the researcher does not know the identity 
of the respondents. 

Questions on issues such as sexual behavior and income are particularly difficult 
because people consider those to be private matters. A variety of techniques have 
been developed to help ensure confidentiality, but surveys on such issues are hard to 
conduct accurately. 


No Opinion of Your Own? Let Politics Decide 
Source: Morin (10-16 April 1995), p. 36. 


This is an excellent example of how people will respond to survey questions, even 
when they do not know about the issues, and how the wording of questions can in- 
fluence responses. In 1995, the Washington Post decided to expand on a 1978 poll 
taken in Cincinnati, Ohio, in which people were asked whether they “favored or op- 
posed repealing the 1975 Public Affairs Act.” There was no such act, but about one- 
third of the respondents expressed an opinion about it. 

In February 1995, the Washington Post added this fictitious question to its 
weekly poll of 1000 randomly selected respondents: “Some people say the 1975 
Public Affairs Act should be repealed. Do you agree or disagree that it should be re- 
pealed?” Almost half (43%) of the sample expressed an opinion, with 24% agreeing 
that it should be repealed and 19% disagreeing. The Post then tried another trick that 
produced even more disturbing results. This time, they polled two separate groups of 
500 randomly selected adults. The first group was asked: “President Clinton [a De- 
mocrat] said that the 1975 Public Affairs Act should be repealed. Do you agree or 
disagree?” The second group was asked: “The Republicans in Congress said that the 
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1975 Public Affairs Act should be repealed. Do you agree or disagree?” Respondents 
were also asked about their party affiliation. Overall, 53% of the respondents ex- 
pressed an opinion about repealing this fictional act. The results by party affiliation 
were striking: For the “Clinton” version, 36% of the Democrats but only 16% of the 
Republicans agreed that the act should be repealed. For the “Republicans in Con- 
gress” version, 36% of the Republicans but only 19% of the Democrats agreed that 
the act should be repealed. 


3.3 Open or Closed Questions: Should Choices Be Given? 


An open question is one in which respondents are allowed to answer in their own 
words, whereas a closed question is one in which they are given a list of alternatives 
from which to choose their answer. Usually the latter form offers a choice of “other,” 
in which the respondent is allowed to fill in the blank. 


Problems with Closed Questions 


To show the limitation of closed questions, Schuman and Scott (22 May 1987) asked 
about “the most important problem facing this country today.” Half of the sample, 
171 people, were given this as an open question. The most common responses were: 


Unemployment (17%) 

General economic problems (17%) 
Threat of nuclear war (12%) 
Foreign affairs (10%) 


In other words, one of these four choices was volunteered by over half of the re- 
spondents. The other half of the sample was given this as a closed question. Follow- 
ing is the list of choices and the percentage of respondents who chose them: 


The energy shortage (5.6%) 

The quality of public schools (32.0%) 
Legalized abortion (8.4%) 

Pollution (14.0%) 


These four choices combined were mentioned by only 2.4% of respondents in the 
open-question survey; yet they were selected by 60% when they were the only spe- 
cific choices given. Further, respondents in this closed-question survey were given 
an open choice. In addition to the list of four, they were told: “If you prefer, you may 
name a different problem as most important.” On the basis of the closed-form ques- 
tionnaire, policymakers would have been seriously misled about what is important 
to the public. 

It is possible to avoid this kind of astounding discrepancy. If closed questions are 
preferred, they first should be presented as open questions to a test sample before the 
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real survey is conducted. Then the most common responses should be included in 
the list of choices for the closed question. This kind of exercise is usually done as 
part of what’s called a “pilot survey,” in which various aspects of a study design can 
be tried before it’s too late to change them. 


Problems with Open Questions 


The biggest problem with open questions is that the results can be difficult to sum- 
marize. If a survey includes thousands of respondents, it can be a major chore to cat- 
egorize their responses. Another problem, found by Schuman and Scott (22 May 
1987), is that the wording of the question might unintentionally exclude answers that 
would have been appealing had they been included in a list of choices (such as in a 
closed question). To test this, they asked 347 people to “name one or two of the most 
important national or world event(s) or change(s) during the past 50 years.” The most 
common choices and the percentage who mentioned them were: 


World War II (14.1%) 

Exploration of space (6.9%) 

Assassination of John F. Kennedy (4.6%) 

The Vietnam War (10.1%) 

Don’t know (10.6%) 

All other responses (53.7%) 
The same question was then repeated in closed form to a new group of 354 people. 
Five choices were given: the first four choices in the preceding list plus “invention 
of the computer.” Of the 354 respondents, the percentage of those who selected each 
choice was: 

World War II (22.9%) 

Exploration of space (15.8%) 

Assassination of John F. Kennedy (11.6%) 

The Vietnam War (14.1%) 

Invention of the computer (29.9%) 

Don’t know (0.3%) 

All other responses (5.4%) 
The most frequent response was “invention of the computer,’ which had been men- 
tioned by only 1.4% of respondents in the open question. Clearly, the wording of the 
question led respondents to focus on “events” rather than “changes,” and the inven- 
tion of the computer did not readily come to mind. When it was presented as an op- 
tion, however, people realized that it was indeed one of the most important events or 
changes during the past 50 years. In summary, there are advantages and disadvan- 
tages to both approaches. One compromise is to ask a small test sample to list the 
first several answers that come to mind, and then use the most common of those in 


a closed-question survey. These choices could be supplemented with additional an- 
swers such as “invention of the computer,’ which may not readily come to mind. 
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Remember that, as the reader, you have an important role in interpreting the re- 
sults. You should always be informed as to whether questions were asked in open or 
closed form, and if the latter, you should be told what the choices were. You should 
also be told whether “don’t know” or “no opinion” was offered as a choice in either 
case. 


3.4 Defining What Is Being Measured 


EXAMPLE 3 


EXAMPLE 4 


Teenage Sex 


To understand the results of a survey or an experiment, we need to know exactly what 
was measured. Consider this example. A letter to advice columnist Ann Landers stated: 
“According to a report from the University of California at San Francisco . . . sexual ac- 
ivity among adolescents is on the rise. There is no indication that this trend is slowing 
down or reversing itself.” The letter went on to explain that these results were based on 
a national survey (Davis (CA) Enterprise, 19 February 1990, p. B-4). On the same day, in 
he same newspaper, an article entitled “Survey: Americans conservative with sex” re- 
ported that “teenage boys are not living up to their reputations. [A study by the Urban 
Institute in Washington] found that adolescents seem to be having sex less often, with 
ewer girls and at a later age than teenagers did a decade ago” (p. A-9). 

Here we have two apparently conflicting reports on adolescent sexuality, both re- 
ported on the same day in the same newspaper. One indicated that teenage sex was on 
he rise; the other indicated that it was on the decline. Although neither report speci- 
ied exactly what was measured, the letter to Ann Landers proceeded to note that “na- 
ional statistics show the average age of first intercourse is 17.2 for females and 16.5 for 
males.” The article stating that adolescent sex was on the decline measured it in terms 
of frequency. The result was based on interviews with 1880 boys between the ages of 
15 and 19, in which “the boys said they had had six sex partners, compared with seven 
a decade earlier. They reported having had sex an average of three times during the pre- 
vious month, compared with almost five times in the earlier survey.” Thus, it is not 
enough to note that both surveys were measuring adolescent or teenage sexual behav- 
ior. In one case, the author was, at least partially, discussing the age of first intercourse, 
whereas in the other case the author was discussing the frequency. a 


The Unemployed 


Ask people whether they know anyone who is unemployed; they will invariably say yes. 
But most people don't realize that in order to be officially unemployed, and included in 
the unemployment statistics given by the U.S. government, you must meet very strin- 
gent criteria. The Bureau of Labor Statistics uses this definition when computing the of- 
ficial United States unemployment rate (http:/Awww.bls.gov/cps/cps_faq.htm#Ques5; 
accessed Oct 6, 2003): 


Persons are classified as unemployed if they do not have a job, have actively 
looked for work in the prior 4 weeks, and are currently available for work. 


To find the unemployment rate, the number of people who meet this definition is di- 
vided by the total number of people “in the labor force,” which includes these individ- 
uals and people classified as employed. But “discouraged workers” are not included at 
all. “Discouraged workers” are defined as: 
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Persons not in the labor force who want and are available for a job and who have 
looked for work sometime in the past 12 months (or since the end of their last job 
if they held one within the past 12 months), but who are not currently looking be- 
cause they believe there are no jobs available or there are none for which they 
would qualify. (http:/Awww.bls.gov/bls/glossary.htm; accessed Oct 6, 2003) 


If you know someone who fits that definition, you would undoubtedly think of that 
person as unemployed even though they hadn't looked for work in the past 4 weeks. 
However, he or she would not be included in the official statistics. You can see that 
the true number of people who are not working is higher than government statistics 
indicate. a 


These two examples illustrate that when you read about measurements taken by 
someone else, you should not automatically assume you are speaking a common lan- 
guage. A precise definition of what is meant by “adolescent sexuality” or “unem- 
ployment” should be provided. 


Some Concepts Are Hard to Define Precisely 


Sometimes it is not the language but the concept itself that is ill-defined. For exam- 
ple, there is still no universal agreement on what should be measured with intelli- 
gence, or IQ, tests. The tests were originated at the beginning of the 20th century in 
order to determine the mental level of school children. The intelligence quotient (IQ) 
of a child was found by dividing the child’s “mental level” by his or her chronolog- 
ical age. The “mental level” was determined by comparing the child’s performance 
on the test with that of a large group of “normal” children, to find the age group the 
individual’s performance matched. Thus, if an 8-year-old child performed as well on 
the test as a “normal” group of 10-year-old children, he or she would have an IQ of 
100 X (10/8) = 125. 

IQ tests have been expanded and refined since the early days, but they continue 
to be surrounded by controversy. One reason is that it is very difficult to define what 
is meant by intelligence. It is difficult to measure something if you can’t even agree 
on what it is you are trying to measure. If you are interested in knowing more about 
these tests and the surrounding controversies, you can find numerous books on the 
subject. Anastasi and Urbina (1997) provide a detailed discussion of a large variety 
of psychological tests, including IQ tests. 


EXAMPLE 5 Stress in Kids 


PLIN The studies reported in News Stories 13 and 15 both included “stress” as one of the im- 

portant measurements used. But they differed in how they measured stress. In Original 

woe” Source 13, “2003 CASA National Survey of American Attitudes on Substance Abuse VIII: 
Teens and Parents,” teenage respondents were asked: 


How much stress is there in your life? Think of a scale between 0 and 10, where 0 
means you usually have no stress at all and 10 means you usually have a very 
great deal of stress, which number would you pick to indicate how much stress 
there is in your life? (p. 40) 


CASE STUDY 3.2 


EXAMPLE 6 
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Categorizing responses as low stress (0 to 3), moderate stress (4 to 6), and high stress 
(7 to 10), the researchers found that low, medium, and high stress were reported by 
29%, 45%, and 26% of teens, respectively. 

For News Story 15, the children were asked more specific questions to measure 
stress. According to Additional News Story 15, “To gauge their stress, the children were 
given a standard questionnaire that included questions like: ‘How often have you felt 
that you couldn't control the important things in your life?’ “ 

There is no way to know which method is more likely to produce an accurate mea- 
sure of “stress,” partly because there is no fixed definition of stress. Stress in one sce- 
nario might mean that someone is working hard to finish an exciting project with a tight 
deadline. In another scenario it might mean that someone feels helpless and out of con- 
trol. Those two versions are likely to have very different consequences on someone's 
health and well-being. What is important is that as a reader, you are informed about 
how the researchers measured stress in any given study. a 


Measuring Attitudes and Emotions 


Similar problems exist with trying to measure attitudes and emotions such as self- 
esteem and happiness. The most common method for trying to measure such things 
is to have respondents read statements and determine the extent to which they agree 
with the statement. For example, a test for measuring happiness might ask respon- 
dents to indicate their level of agreement, from “strongly disagree” to “strongly 
agree,” with statements such as “I generally feel optimistic when I get up in the 
morning.” To produce agreement on what is meant by characteristics such as “intro- 
version,’ psychologists have developed standardized tests that claim to measure 
those attributes. 


Questions in Advertising 
Source: Crossen (1994), pp. 74-75. 


Advertisements commonly present results without telling the listener or reader what 
choices were given to the respondents of a survey. Here are two examples: 


Levi Strauss released a marketing package presented as “Levi's 501 Report, a fall fashion 
survey conducted annually on 100 U.S. campuses.” As part of the report, it was noted 
that 90% of college students chose Levi's 501 jeans as being “in” on campus. What the 
resulting advertising failed to reveal was the list of choices, which noticeably omits blue 
jeans except for Levi's 501 jeans: 


Levi's 501 jeans T-shirts with graphics 
1960s-inspired clothing Lycra/spandex clothing 
Overalls Patriotic-themed clothing 
Decorated denim Printed, pull-on beach pants 


Long-sleeved, hooded T-shirts Neon-colored clothing 
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An advertisement for Triumph cigarettes boasted: “TRIUMPH BEATS MERIT—an 
amazing 60% said Triumph tastes as good or better than Merit.” In truth, three choices 
were offered to respondents, including “no preference.” The results were: 36% 
preferred Triumph, 40% preferred Merit, and 24% said the brands were equal. So, 
although the wording of the advertisement is not false, it is also true that 64% said 
Merit tastes as good as or better than Triumph. Which brand do you think wins? 


3.5 Defining a Common Language 


So that we’re all speaking a common language for the rest of this book, we need to 
define some terms. We can perform different manipulations on different types of 
data, so we need a common understanding of what those types are. Other terms de- 
fined in this section are those that are well known in everyday usage but that have a 
slightly different technical meaning. 


Categorical versus Measurement Variables 


Thus far in this book, we have seen examples of measuring opinions (such as what 
you think is the most important problem facing society), numerical information 
(such as weight gain in infants), and attributes that can be transformed into numeri- 
cal information (such as IQ). To understand what we can do with these measure- 
ments, we need definitions to distinguish numerical measures from qualitative ones. 
Although statisticians make numerous fine distinctions among types of measure- 
ments, for our purposes it will be sufficient to distinguish between just two main 
types: categorical variables and measurement variables. Subcategories of these types 
will be defined for those who want more detail. 


Categorical Variables 

Categorical variables are those we can place into a category but that may not have 
any logical ordering. For example, you could be categorized as male or female. You 
could also be categorized based on what you name as the most important problem 
facing society. Notice that we are limited in how we can manipulate this kind of in- 
formation numerically. For example, we cannot talk about the average problem fac- 
ing society in the same way as we can talk about the average weight gain of infants 
during the first few days of life. 

If the possible categories have a natural ordering, the term ordinal variable is 
sometimes used. For instance, in a public opinion poll respondents may be asked to 
give an opinion chosen from “strongly agree, agree, neutral, disagree, strongly dis- 
agree.” Level of education attained may be categorized as “less than high school, 
high school graduate, college graduate, postgraduate degree.” To distinguish them 
from ordinal variables, categorical variables for which the categories do not have a 
natural ordering are sometimes called nominal variables. 


CHAPTER 3 Measurements, Mistakes, and Misunderstandings 47 


Measurement Variables 

Measurement variables, also called quantitative variables, are those for which we 
can record a numerical value and then order respondents according to those values. 
For example, IQ is a measurement variable because it can be expressed as a single 
number. An IQ of 130 is higher than an IQ of 100. Age, height, and number of cig- 
arettes smoked per day are other examples of measurement variables. Notice that 
these can be worked with numerically. Of course, not all numerical summaries will 
make sense even with measurement variables. For example, if one person in your 
family smokes 20 cigarettes a day and the remaining three members smoke none, it 
is accurate but misleading to say that the average number of cigarettes smoked by 
your family per day is 5 per person. We will learn about reasonable numerical sum- 
maries in Chapter 7. 

Occasionally a further distinction is made for measurement variables based on 
whether ratios make sense. An interval variable is a measurement variable in which 
it makes sense to talk about differences, but not about ratios. Temperature is a good 
example of an interval variable. If it was 20 degrees last night and it’s 40 degrees to- 
day, we wouldn’t say it is twice as warm today as it was last night. But it would be 
reasonable to say that it is 20 degrees warmer, and it would mean the same thing as 
saying that when it’s 60 degrees it’s 20 degrees warmer than when it’s 40 degrees. A 
ratio variable has a meaningful value of zero, and it makes sense to talk about the 
ratio of one value to another. Pulse rate is a good example. For instance, if your pulse 
rate is 60 before you exercise and 120 after you exercise, it makes sense to say that 
your pulse rate doubled during exercise. 


Continuous versus Discrete Measurement Variables 


Even when we can measure something with a number, we may need to distinguish 
further whether it can fall on a continuum. A discrete variable is one for which you 
could actually count the possible responses. For example, if we measure the number 
of automobile accidents on a certain stretch of highway, the answer could be 0, 1, 2, 
3, and so on. It could not be 2 1/2 or 3.8. Conversely, a continuous variable can be 
anything within a given interval. Age, for example, falls on a continuum. 

Something of a gray area exists between these definitions. For example, if we 
measure age to the nearest year, it may seem as though it should be called a discrete 
variable. But the real difference is conceptual. With a discrete variable you can count 
the possible responses without having to round off. With a continuous variable you 
can’t. In case you are confused by this, note that long ago you probably figured 
out the difference between the phrases “the number of” and “the amount of.” You 
wouldn’t say, “the amount of cigarettes smoked,” nor would you say, “the number of 
water consumed.” Discrete variables are analogous to numbers of things, and con- 
tinuous variables are analogous to amounts. You still need to be careful about word- 
ing, however, because we have a tendency to express continuous variables in discrete 
units. Although you wouldn’t say, “the number of water consumed,” you might say, 
“the number of glasses of water consumed.” That’s why it’s the concept of number 
versus amount that you need to think about. 
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Validity, Reliability, and Bias 


The words we define in this section are commonly used in the English language, but 
they also have specific definitions when applied to measurements. Although these 
definitions are close to the general usage of the words, to avoid confusion we will 
spell them out. 


Validity 

When you talk about something being valid, you generally mean that it makes sense to 
you; it is sound and defensible. The same can be said for a measurement. A valid mea- 
surement is one that actually measures what it claims to measure. Thus, if you tried to 
measure happiness with an IQ test, you would not get a valid measure of happiness. 

A more realistic example would be trying to determine the selling price of a 
home. Getting a valid measurement of the actual sales price of a home is tricky be- 
cause the purchase often involves bargaining on what items are to be left behind by 
the old owners, what repairs will be made before the house is sold, and so on. These 
items can change the recorded sales price by thousands of dollars. If we were to de- 
fine the “selling price” as the price recorded in public records, it may not actually re- 
flect the price the buyer and seller had agreed was the true worth of the home. 

To determine whether a measurement is valid, you need to know exactly what 
was measured. For example, many readers, once they are informed of the definition, 
do not think the unemployment figures provided by the U.S. government are a valid 
measure of unemployment, as the term is generally understood. Remember (from 
Example 4) that the figures do not include “discouraged workers.” However, the gov- 
ernment statistics are a valid measure of the percentage of the “labor force” that is 
currently “unemployed,” according to the precise definitions supplied by the Bureau 
of Labor Statistics. The problem is that most people do not understand exactly what 
the government has measured. 


Reliability 
When we say something or someone is reliable, we mean that that thing or person can 
be depended upon time after time. A reliable car is one that will start every time and 
get us where we are going without worry. A reliable friend is one who is always there 
for us, not one who is sometimes too busy to bother with us. Similarly, a reliable mea- 
surement is one that will give you or anyone else approximately the same result time 
after time when taken on the same object or individual. For example, a reliable way to 
define the selling price of a home would be the officially recorded amount. This may 
not be valid, but it would give us a consistent figure without any ambiguity. 
Reliability is a useful concept in psychological and aptitude testing. An IQ test 
is obviously not much use if it measures the same person’s IQ to be 80 one time and 
130 the next. Whether we agree that the test is measuring what we really mean by 
“intelligence” (that is, whether it is really valid), it should at least be reliable enough 
to give us approximately the same number each time. Commonly used IQ tests are 
fairly reliable: About two-thirds of the time, taking the test a second time gives a 
reading within 2 or 3 points of the first test, and, most of the time, it gives a reading 
within about 5 points. 
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The most reliable measurements are physical ones taken with a precise measur- 
ing instrument. For example, it is much easier to get a reliable measurement of 
height than of happiness, assuming you have an accurate tape measure. However, 
you should be cautious of measurements given with greater precision than you think 
the measuring tool would be capable of providing. The degree of precision probably 
exceeds the reliability of the measurement. For example, if your friend measures the 
width of a swimming pool with a ruler and reports that it is 15.771 feet wide, which 
is 15’ 9 1/4", you should be suspicious. It would be very difficult to measure a dis- 
tance that large reliably with a 12-inch ruler. A second measuring attempt would un- 
doubtedly give a different number. 


Bias 

A systematic prejudice in one direction is called a bias. Similarly, a measurement 
that is systematically off the mark in the same direction is called a biased measure- 
ment. If you were trying to weigh yourself with a scale that was not satisfactorily 
adjusted at the factory and was always a few pounds under, you would get a biased 
view of your own weight. When we used the term earlier in discussing the wording 
of questions, we noted that either intentional or unintentional bias could enter into 
the responses of a poorly worded survey question. Notice that a biased measurement 
differs from an unreliable measurement because it is consistently off the mark in the 
same direction. 


Variability across Measurements 


If someone has variable moods, we mean that that person has unpredictable 
swings in mood. When we say the weather is quite variable, we mean it changes 
without any consistent pattern. Most measurements are prone to some degree of 
variability. By that, we mean that they are likely to differ from one time to the 
next or from one individual to the next because of unpredictable errors or discrep- 
ancies that are not readily explained. If you tried to measure your height as in- 
structed at the beginning of this chapter, you probably found some unexplainable 
variability from one time to the next. If you tried to measure the length of a table 
by laying a ruler end to end, you would undoubtedly get a slightly different answer 
each time. 

Unlike the other terms we have defined, which are used to characterize a single 
measurement, variability is a concept used when we talk about two or more mea- 
surements in relation to each other. Sometimes two measurements vary because the 
measuring device produces unreliable results—for example, when we try to measure 
a large distance with a small ruler. The amount by which each measurement differs 
from the true value is called measurement error. 

Variability can also result from changes across time in the system being mea- 
sured. For example, even with a very precise measuring device your recorded blood 
pressure will differ from one moment to the next. Unemployment rates vary from 
one month to the next because people move in and out of jobs and the workforce. 
These differences represent natural variability across time in the individual or 
system being measured. 
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Natural variability also explains why many measurements differ across individ- 
uals. Even if we could measure everyone’s height precisely, we wouldn’t get the 
same value for everyone because people naturally come in different heights. If we 
measured unemployment rates in different states of the United States at the same 
time, they would vary because of natural variability in conditions and individuals 
across states. If we measure the annual rainfall in one location for each of many 
years, it will vary because weather conditions naturally differ from one year to the 
next. 


The Importance of Natural Variability 


Understanding the concept of natural variability is crucial to understanding modern 
statistical methods. When we measure the same quantity across several individuals, 
such as the weight gain of newborn babies, we are bound to get some variability. Al- 
though some of this may be due to our measuring instrument, most of it is simply 
due to the fact that everyone is different. Variability is plainly inherent in nature. Ba- 
bies all gain weight at their own pace. If we want to compare the weight gain of a 
group of babies who have consistently listened to a heartbeat to the weight gain of a 
group of babies who have not, we first need to know how much variability to expect 
due to natural causes. 

We encountered the idea of natural variability when we discussed comparing 
resting pulse rates of men and women in Chapter 1. If there were no variability 
within each sex, it would be easy to detect a difference between males and females. 
The more variability there is within each group, the more difficult it is to detect a dif- 
ference between groups. Natural variability can occur when taking repeated mea- 
surements on the same individual as well. Even if it could be measured precisely, 
your pulse rate is not likely to remain constant throughout the day. Some measure- 
ments are more likely to exhibit this variability than others. For example, height (if 
it could be measured precisely) and your opinion on issues like gun control and abor- 
tion are likely to remain constant over short time periods. 

To summarize, variability across measurements can occur for at least three rea- 
sons. First, measurements are imprecise, and thus measurement error is a source of 
variability. Second, there is natural variability across individuals at any given time. 
And third, often there is natural variability in a characteristic of the same individual 
across time. 

In Part 4, we will learn how to sort out differences due to natural variability from 
differences due to features we can define, measure, and possibly manipulate, such as 
variability in blood pressure due to amount of salt consumed, or variability in weight 
loss due to time spent exercising. In this way, we can study the effects of diet or 
lifestyle choices on disease, of advertising campaigns on consumer choices, of exer- 
cise on weight loss, and so on. 

This one basic idea, comparing natural variability to the variability induced by 
different behaviors, interventions, or group memberships, forms the heart of modern 
statistics. It has allowed Salk to conclude that heartbeats are soothing to infants and 
the medical community to conclude that aspirin helps prevent heart attacks. We will 
see numerous other conclusions based on this idea throughout this book. 
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Exercises 


Asterisked (*) exercises are included in the Solutions at the back of the book. 


*1. Give an example of a measure that is 
*a,. Valid and categorical 
*b. Reliable but biased 
*c. Unbiased but not reliable 
2. Give an example of a survey question that is 
a. Deliberately biased 
b. Unintentionally biased 
c. Unnecessarily complex 
d. Likely to cause respondents to lie 
3. Give an example of a survey question that is 
a. Most appropriately asked as an open question 
b. Most appropriately asked as a closed question 


*4. Explain which (one or more) of the seven pitfalls listed in Section 3.2 applies to 
each of the following potential survey questions: 


*a. Do you support banning prayers in schools so that teachers have more time 
to spend teaching? 


b. Do you agree that marijuana should be legal? 


c. Studies have shown that consuming one alcoholic drink daily helps reduce 
heart disease. How many alcoholic drinks do you consume daily? 


*5. Refer to Question 4. Reword each question so it avoids the seven pitfalls. 


*6. Specify whether each of the following is a categorical or measurement variable. 
If you think the variable is ambiguous, discuss why. 


*a. Years of formal education 


*b. Highest level of education completed (grade school, high school, college, 
higher than college) 


c. Brand of car owned 
d. Price paid for the last car purchased 
e. Type of car owned (subcompact, compact, mid-size, full-size, sports, pickup) 


7. Refer to the previous exercise. In each case, if the variable is categorical, spec- 
ify whether it is ordinal or nominal. If it is a measurement variable, specify 
whether it is an interval or a ratio variable. Explain your answers. 


*8. Specify whether each of the following measurements is discrete or continuous. 
If you think the measurement is ambiguous, discuss why. 


*a, The number of floors in a building 
*b. The height of a building measured as precisely as possible 


c. The number of words in a book 
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*10. 


11. 


*12. 


13. 


14. 


15. 


*16. 


d. The weight of a book 
e. A person’s IQ 


. Refer to the previous exercise. In each case, explain whether the measurement 


is an interval or a ratio variable. 
Explain whether a variable can be both 
a. Nominal and categorical 

b. Nominal and ordinal 

c. Interval and categorical 

d. Discrete and interval 


If we were interested in knowing whether the average price of homes in a cer- 
tain county had gone up or down this year in comparison with last year, would 
we be more interested in having a valid measure or a reliable measure of sales 
price? Explain. 

In Chapter 1, we discussed Lee Salk’s experiment in which he exposed one 
group of infants to the sound of a heartbeat and compared their weight gain to 
that of a group not exposed. Do you think it would be easier to discover a dif- 
ference in weight gain between the group exposed to the heartbeat and the “con- 
trol group” if there were a lot of natural variability among babies, or if there 
were only a little? Explain. 


Do you think the crime statistics reported by the police are a valid measure of 
the amount of crime in a given city? Are they a reliable measure? Discuss. 


Refer to Case Study 2.2, “Brooks Shoes Brings Flawed Study to Court.” Discuss 
the study conducted by Brooks Shoe Manufacturing Company in the context of 
the seven pitfalls, listed in Section 3.2, that can be encountered when asking 
questions in a survey. 


An advertiser of a certain brand of aspirin (let’s call it Brand B) claims that it is 
the preferred painkiller for headaches, based on the results of a survey of 
headache sufferers. The choices given to respondents were: Tylenol, Extra- 
Strength Tylenol, Brand B aspirin, Advil. 

a. Is this an open- or closed-form question? Explain. 

b. Comment on the variety of choices given to respondents. 

c. Comment on the advertiser’s claim. 


Schuman and Presser (1981, p. 277) report a study in which one set of respon- 
dents was asked question A, and the other set was asked question B: 


A. Do you think the United States should forbid public speeches against 
democracy? 


B. Do you think the United States should allow public speeches against 
democracy? 


For one version of the question, only about one-fifth of the respondents were 
against such freedom of speech, whereas for the other version almost half were 
against such freedom of speech. Which question do you think elicited which re- 
sponse? Explain. 


17. 


18. 


*19, 


20. 


21. 


*22. 


*23. 


24. 


*25. 
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Give an example of two questions in which the order in which they are presented 
would determine whether the responses were likely to be biased. 


In February 1998, U.S. President Bill Clinton was under investigation for al- 
legedly having had an extramarital affair. A Gallup Poll asked the following two 
questions: “Do you think most presidents have or have not had extramarital af- 
fairs while they were president?” and then “Would you describe Bill Clinton’s 
faults as worse than most other presidents, or as no worse than most other pres- 
idents?” For the first question, 59% said “have had,” 33% said “have not,” and 
the remaining 8% had no opinion. For the second question, 24% said “worse,” 
75% said “no worse,” and only 1% had no opinion. Do you think the order of 
these two questions influenced the results? Explain. 


Sometimes medical tests, such as those for detecting HIV, are so sensitive that 
people do not want to give their names when they take the test. Instead, they are 
given a number or code, which they use to obtain their results later. Is this pro- 
cedure anonymous testing or is it confidential testing? Explain. 


Give three versions of a question to determine whether people think smoking 
should be banned on all airline flights. Word the question to be as follows: 


a. As unbiased as possible 
b. Likely to get people to respond that smoking should be forbidden 
c. Likely to get people to respond that smoking should not be forbidden 


Explain the difference between a discrete variable and a categorical variable. 
Give an example of each type. 


Suppose you were to compare two routes to school or work by timing yourself 
on each route for five days. Suppose the times on one route were (in minutes) 
10, 12, 13, 15, 20, and on the other route they were 10, 15, 16, 18, 21. 


*a, The average times for the two routes are 14 minutes and 16 minutes. Would 
you be willing to conclude that the first route is faster, on average, based on 
these sample measurements? 


*b. Give an example of two sets of times, where the first has an average of 14 
minutes and the second an average of 16 minutes, for which you would be 
willing to conclude that the first route is faster. 


c. Explain how the concept of natural variability entered into your conclusions 
in parts a and b. 


Give an example of a characteristic that could be measured as either a discrete 
or a continuous variable, depending on the types of units used. 


Airlines compute the percentage of flights that are on time to be the percentage 
that arrive no later than 15 minutes after their scheduled arrival time. Is this a 
valid measure of on-time performance? Is it a reliable measure? Explain. 


If each of the following measurements were to be taken on a group of 50 col- 
lege students (once only for each student), it is unlikely that all 50 of them 
would yield the same value. In other words, there would be variability in the 
measurements. In each case, explain whether their values are likely to differ be- 
cause of natural variability across time, natural variability across individuals, 
measurement error, or some combination of these three causes. 
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*26. 


*29, 


*a. Systolic blood pressure 
b. Blood type (A, B, O, AB) 

*c, Time on the student’s watch when the actual time is 12 noon 
d. Actual time when the student’s watch says it’s 12 noon 


Explain whether there is likely to be variability in the following measurements 
if they were to be taken on 10 consecutive days for the same student. If so, ex- 
plain whether the variability most likely would be due to natural variability 
across time, natural variability across individuals, measurement error, or some 
combination of these three causes. 

*a. Systolic blood pressure 

*b. Blood type (A, B, O, AB) 
c. Time on the student’s watch when the actual time is 12 noon 


d. Actual time when the student’s watch says it’s 12 noon 


. Read Original Source 4 on the CD, “Duke Health Briefs: Positive Outlook 


Linked to Longer Life in Heart Patients.” Explain how the researchers measured 
“happiness.” 


. Locate Original Source 11 on the CD, “Driving impairment due to sleepiness is 


exacerbated by low alcohol intake.” Find the description of how the researchers 
measured “subjective sleepiness.” 


a. Explain how “subjective sleepiness” was measured. 

b. Was “subjective sleepiness” measured as a nominal, ordinal, or measurement 
variable? Explain. 

Explain how “depression” was measured for the research discussed in News 

Story 19 in the Appendix, “Young romance may lead to depression, study says.” 


. Refer to the detailed report labeled as Original Source 13: “2003 CASA Na- 


tional Survey of American Attitudes on Substance Abuse VIII: Teens and Par- 
ents” on the CD. 


a. Locate the questions asked of the teens, in Appendix D. Two questions asked 
as “open questions” were Question 1 and Question 11. Explain which of the 
two questions was less likely to cause problems in categorizing the answers. 

b. The most common response to Question 11 was “Sports team.” Read the 
question and explain why this might have been the case. 

c. Two versions of Question 28 were asked, one using the word sold and one 
using the word used. Did the wording of the question appear to affect the re- 
sponses? Explain. 


. Refer to the detailed report labeled as Original Source 13: “2003 CASA Na- 


tional Survey of American Attitudes on Substance Abuse VIII: Teens and Par- 
ents” on the CD. Locate the questions asked of the parents, in Appendix E. For 
each of the following questions explain whether the response was a nominal 
variable, an ordinal variable, or a measurement variable. 

a. Question 2 


b. Question 9 
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c. Question 12 
d. Question 19 
e. Question 29 
f. Question 30 


spy Exercises 32 to 35 refer to News Study 2, “Research shows women harder hit by 
d hangovers” and Original Source 2, “Development and initial validation of the 


Hangover Symptoms Scale: Prevalence and correlates of hangover symptoms in col- 
lege students” on the CD accompanying the book. 


32. 


*33. 


34. 


35. 


The researchers were interested in measuring the severity of hangovers for each 
person so they developed a “Hangover Symptoms Scale.” Read the article and 
explain what they measured with this scale. 


Explain whether the Hangover Symptoms Scale for each individual in this study 
is likely to be 


* 


a. A valid measure of hangover severity 
b. A reliable measure of hangover severity 


To make the conclusion that women are harder hit by hangovers, the researchers 
measured two variables on each individual. Specify the two variables and ex- 
plain whether each one is a categorical or a measurement variable. 


The measurements in this study were self-reported by the participants. Explain 
the extent to which you think this may systematically have caused the measure- 
ments of hangover severity of men or women or both to be biased, and whether 
that may have affected the conclusions of the study in any way. 


Mini-Projects 


1. 


Measure the heights of five males and five females. Draw a line to scale, starting 
at the lowest height in your group and ending at the highest height, and mark each 
male with an M and each female with an F. It should look something like this: 


F FM FMF F MM M 
5' 6'2" 


Explain exactly how you measured the heights, and then answer each of the 

following: 

a. Are your measures valid? 

b. Are your measures reliable? 

c. How does the variability in the measurements within each group compare to 
the difference between the two groups? For example, are all of your men 
taller than all of your women? Are they completely intermixed? 

d. Do you think your measurements would convince an alien being that men are 
taller, on average, than women? Explain. Use your answer to part c as part 
of your explanation. 
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2. Design a survey with three questions to measure attitudes toward something of 
interest to you. Now design a new version by changing just a few words in each 
question to make it deliberately biased. Choose 20 people to whom you will ad- 
minister the survey. Put their names in a hat (or a box or a bag) and draw out 10 
names. Administer the first (unbiased) version of the survey to this group and 
the second (biased) version to the remaining 10 people. Compare the responses 
and discuss what happened. 


3. Find a study that includes an emotion like “depression” or “happiness” as one 
of the measured variables. Explain how the researchers measured that emotion. 
Discuss whether the method of measurement is likely to produce valid mea- 
surements. Discuss whether the method of measurement is likely to produce re- 
liable measurements. 


References 


Anastasi, Anne, and Susana Urbina. (1997). Psychological testing. 7th ed. New York: 
Macmillan. 


Crossen, Cynthia. (1994). Tainted truth. New York: Simon and Schuster. 


Loftus, E. F., and J. C. Palmer. (1974). Reconstruction of automobile destruction: An exam- 
ple of the interaction between language and memory. Journal of Verbal Learning and Ver- 
bal Behavior 13, pp. 585-589. 


Morin, Richard. (10-16 April 1995). What informed public opinion? Washington Post, Na- 
tional Weekly Edition. 

Plous, Scott. (1993). The psychology of judgment and decision making. New York: McGraw- 
Hill. 

Schuman, H., and S. Presser. (1981). Questions and answers in attitude surveys. New York: 
Academic Press. 


Schuman, H., and J. Scott. (22 May 1987). Problems in the use of survey questions to mea- 
sure public opinion. Science 236, pp. 957-959. 


How to Get a Good Sample 


Thought Questions 


1. 


What do you think is the major difference between a survey (such as a public opin- 
ion poll) and an experiment (such as the heartbeat experiment in Case Study 1.1)? 


. Suppose a properly chosen sample of 1600 people across the United States was 


asked if they regularly watch a certain television program, and 24% said yes. How 
close do you think that is to the percentage of the entire country who watch the 
show? Within 30%? 10%? 5%? 1%? Exactly the same? 


. Many television stations conduct polls by asking viewers to call one phone number if 


they feel one way about an issue and a different phone number if they feel the op- 
posite. Do you think the results of such a poll represent the feelings of the commu- 
nity? Do you think they represent the feelings of all those watching the TV station at 
the time or the feelings of some other group? Explain. 


. Suppose you had a telephone directory listing all the businesses in a city, alphabet- 


ized by type of business. If you wanted to phone 100 of them to get a representa- 
tive sampling of opinion on some issue, how would you select which 100 to phone? 
Why would it not be a good idea to simply use the first 100 businesses listed? 


. There are many professional polling organizations, such as Gallup and Roper. They 


often report on surveys they have done, announcing that they have sampled 1243 
adults, or some such number. How do you think they select the people to include in 
their samples? 
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4.1 Common Research Strategies 


In Chapters 1 to 3, we discussed scientific studies in general, without differentiating 
them by type. In this chapter and the next, we are going to look at proper ways to 
conduct specific types of studies. When you read the results of a scientific study, the 
first thing you need to do is determine which research strategy was used. You can 
then see whether or not the study used the proper methods for that strategy. In this 
chapter and the next, you will learn about potential difficulties and outright disasters 
that can befall each type of study, as well as some principles for executing them cor- 
rectly. First, let’s examine the common types of research strategies. 


Sample Surveys 


You are probably quite familiar with sample surveys, at least in the form of political 
and opinion polls. In a sample survey, a subgroup of a large population is ques- 
tioned on a set of topics. The results from the subgroup are used as if they were rep- 
resentative of the larger population, which they will be if the sample was chosen 
correctly. There is no intervention or manipulation of the respondents in this type of 
research, they are simply asked to answer some questions. We examine sample sur- 
veys in more depth later in this chapter. 


Randomized Experiments 


An experiment measures the effect of manipulating the environment in some way. 
For example, the manipulation may include receiving a drug or medical treatment, 
going through a training program, following a special diet, and so on. In a random- 
ized experiment, the manipulation is assigned to participants on a random basis. In 
Chapter 5 we will learn more about how this is done. 

Most experiments on humans use volunteers because you can’t force someone to 
accept a manipulation. You then measure the result of the feature being manipulated, 
called the explanatory variable, on an outcome, called the outcome variable or re- 
sponse variable. Examples of outcome variables are cholesterol level (after taking 
a new drug), amount learned (after a new training program), or weight loss (after a 
special diet). 

As an example, recall Case Study 1.2, a randomized experiment that investigated 
the relationship between aspirin and heart attacks. The explanatory variable, manip- 
ulated by the researchers, was whether a participant took aspirin or a placebo. The 
variable was then used to help explain the outcome variable, which was whether a 
participant had a heart attack or not. Notice that the explanatory and outcome vari- 
ables are both categorical in this case, with two categories each (aspirin/placebo and 
heart attack/no heart attack). 

Randomized experiments are important because, unlike most other studies, they 
often allow us to determine cause and effect. The participants in an experiment are 
usually randomly assigned to either receive the manipulation or take part in a 
control group. The purpose of the random assignment is to make the two groups ap- 
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proximately equal in all respects except for the explanatory variable, which is pur- 
posely manipulated. Differences in the outcome variable between the groups, if large 
enough to rule out natural chance variability, can then be attributed to the manipula- 
tion of the explanatory variable. 

For example, suppose we flip a coin to assign each of a number of new babies 
into one of two groups. Without any intervention, we should expect both groups to 
gain about the same amount of weight, on average. If we then expose one group to 
the sound of a heartbeat and that group gains significantly more weight than the 
other group, we can be reasonably certain that the weight gain was due to the 
sound of the heartbeat. Similar reasoning applies when more than two groups are 
used. 


Observational Studies 


As we noted in Chapter 1, an observational study resembles an experiment except 
that the manipulation occurs naturally rather than being imposed by the experi- 
menter. For example, we can observe what happens to people’s weight when they 
quit smoking, but we can’t experimentally manipulate them to quit smoking. We 
must rely on naturally occurring events. This reliance on naturally occurring events 
leads to problems with establishing a causal connection because we can’t arrange to 
have a similar control group. For instance, people who quit smoking may do so be- 
cause they are on a “health kick” that also includes better eating habits, a change in 
coffee consumption, and so on. In this case, if we were to observe a weight loss (or 
gain) after cessation of smoking, we would not know if it were caused by the 
changes in diet or the lack of cigarettes. In an observational study, you cannot as- 
sume that the explanatory variable of interest to the researchers is the only one that 
may be responsible for any observed differences in the outcome variable. 

A special type of observational study is frequently used in medical research. 
Called a case-control study, it is an attempt to include an appropriate control group. 
In Chapter 5, we will explore the details of how these and other observational stud- 
ies are conducted, and in Chapter 6, we will cover some examples in depth. 

Observational studies do have one advantage over experiments. Researchers are 
not required to induce artificial behavior. Participants are simply observed doing 
what they would do naturally; therefore, the results can be more readily extended to 
the real world. 


Meta-Analyses 


A meta-analysis is a quantitative review of a collection of studies all done on a sim- 
ilar topic. Combining information from various researchers may result in the emer- 
gence of patterns or effects that weren’t conclusively available from the individual 
studies. 

It is becoming quite common for the results of meta-analyses to appear in news- 
papers and magazines. For example, the top headline in the November 24, 1993, San 
Jose Mercury News was, “Why mammogram advice keeps changing: S.F. study con- 
tradicts cancer society’s finding.” The article explained that, in addition to the results 
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of new research in San Francisco, “a recent analysis of eight international studies did 
not find any clear benefit to women getting routine mammograms while in their 
40s.” When you see wording indicating that many studies were analyzed together, 
the report is undoubtedly referring to a meta-analysis. In this case, the eight studies 
in question had been conducted over a 30-year period with 500,000 women. Unfor- 
tunately, that information was missing from the newspaper article. Missing informa- 
tion is one of the problems with trying to evaluate a news article based on 
meta-analysis. In Chapter 25, we will examine meta-analyses and the role they play 
in science. 


Case Studies 


A case study is an in-depth examination of one or a small number of individuals. 
The researcher observes and interviews that individual and others who know about 
the topic of interest. For example, to study a purported spiritual healer, a researcher 
might observe her at work, interview her about techniques, and interview clients who 
had been treated by the healer. We do not cover case studies of this type because they 
are descriptive and do not require statistical methods. We will issue one warning, 
though. Be careful not to assume you can extend the findings of a case study to any 
person or situation other than the one studied. In fact, case studies may be used to 
investigate situations precisely because they are rare and unrepresentative. 


Two Studies That Compared Diets 


There are many claims about the health benefits of various diets, but it is difficult to test 
them because there are so many related variables. For instance, people who eat a veg- 
etarian diet may be less likely to smoke than people who don’t. Most studies that at- 
tempt to test claims about diet are observational studies. 

For instance, News Story 20 in the Appendix, “Eating organic foods reduces pesti- 
cide concentration in children,” is based on an observational study in which parents kept 
a food diary for their children for three days. Concentrations of various pesticides were 
then measured in the children’s urine. The researchers compared the pesticide measure- 
ments for children who ate primarily organic produce and those who ate primarily con- 
ventional produce. They did find lower pesticide levels in the children who ate organic 
foods, but there is no way to know if the difference was the result of food choices, or 
if children who ate organic produce had less pesticide exposure in other ways. (The re- 
searchers did attempt to address this issue, as we will see when we revisit this example.) 
We will learn more about what can be concluded from this type of observational study 
in Chapter 5. 

In contrast, News Story 3, “Rigorous veggie diet found to slash cholesterol,” was 
based on a randomized experiment. The study used volunteers who were willing to be 
told what to eat during the month-long study. The volunteers were randomly assigned 
to one of three diet groups and reduction in cholesterol was measured and compared 
for the three groups at the end of the study. Because the participants were randomly as- 
signed to the three diet groups, other variables that may affect cholesterol, such as 
weight or smoking behavior, should have been similar across all three groups. 

This example illustrates that a choice can sometimes be made between conducting 
a randomized experiment and an observational study. An advantage of a randomized 
experiment is that factors other than the one being manipulated should be similar across 
the groups being compared. An advantage of an observational study is that people do 
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what comes naturally to them. Knowing that a particular diet substantially reduces cho- 
lesterol doesn't help much if no one in the real world would follow the diet. We will re- 
visit these ideas in much greater depth in Chapter 5. a 


4.2 Defining a Common Language 


EXAMPLE 2 


In the remainder of this chapter, we explore the methods used in sample surveys. To 
make our discussion of sampling methods clear, let’s establish a common language. 
As we have seen before, statisticians borrow words from everyday language and at- 
tach specialized meaning to them. 

The first thing you need to know is that researchers sometimes speak synony- 
mously of the individuals being measured and the measurements themselves. You can 
usually figure this out from the context. The relevant definitions cover both meanings. 


= A unit is a single individual or object to be measured. 

The population (or universe) is the entire collection of units about which 
we would like information or the entire collection of measurements we 
would have if we could measure the whole population. 


The sample is the collection of units we actually measure or the collection 
of measurements we actually obtain. 


The sampling frame is a list of units from which the sample is chosen. 
Ideally, it includes the whole population. 


In a sample survey, measurements are taken on a subset, or sample, of 
units from the population. 


= A census is a survey in which the entire population is measured. 


Determining Monthly Unemployment in the United States 


In the United States, the Bureau of Labor Statistics (BLS) is responsible for determining 
monthly unemployment rates. To do this, the BLS does not collect information on all 
adults; that is, it does not take a census. Instead, employees visit approximately 60,000 
households, chosen from a list of all known households in the country, and obtain in- 
formation on the approximately 116,000 adults living in them. They classify each person 
as employed, unemployed, or “not in the labor force.” The last category includes the 
“discouraged workers” discussed in Chapter 3. The unemployment rate is the number 
of unemployed persons divided by the sum of the employed and unemployed. Those 
“not in the labor force” are not included at all. (See the U.S. Department of Labor's BLS 
Handbook of Methods, referenced at the end of the chapter, for further details, or visit 
their Web site, http:/Awww.bls.gov.) 

Before reading any further, try to apply the definitions you have just learned to 
the way the BLS calculates unemployment. In other words, specify the units, the popu- 
lation, the sampling frame, and the sample. Be sure to include both forms of each def- 
inition when appropriate. The units of interest to the BLS are adults in the labor force, 
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meaning adults who meet their definitions of employed and unemployed. Those who 
are “not in the labor force” are not relevant units. The population of units consists of all 
adults who are in the labor force. The population of measurements, if we could obtain 
it, would consist of the employment status (working or not working) of everyone in the 
labor force. The sampling frame is the list of all known households in the country. The 
people who actually get asked about their employment status by the BLS constitute 
the units in the sample, and their actual employment statuses constitute the measure- 
ments in the sample. E 


4.3 The Beauty of Sampling 


Here is some information that may astound you. If you use commonly accepted 
methods to sample 1500 adults from an entire population of millions of adults, you 
can almost certainly gauge, to within 3%, the percentage of the entire population 
who have a certain trait or opinion. (There is nothing magical about 1500 and 3%, 
as you will soon see.) Even more amazing is the fact that this result doesn’t depend 
on how big the population is; it depends only on how many are in the sample. Our 
sample of 1500 would do equally well at estimating, to within 3%, the percentage of 
a population of 10 billion. Of course, you have to use a proper sampling method— 
but we address that later. 

You can see why researchers are content to rely on public opinion polls rather 
than trying to ask everyone for their opinion. It is much cheaper to ask 1500 people 
than several million, especially when you can get an answer that is almost as accu- 
rate. It also takes less time to conduct a sample survey than a census, and because 
fewer interviewers are needed, there is better quality control. 


Accuracy of a Sample Survey: Margin of Error 


Most sample surveys are used to estimate the proportion or percentage of people 
who have a certain trait or opinion. For example, the Nielsen ratings, used to deter- 
mine the percentage of American television sets tuned to a particular show, are based 
on a sample of a few thousand households. Newspapers and magazines routinely 
conduct surveys of a few thousand people to determine public opinion on current 
topics of interest. As we have said, these surveys, if properly conducted, are amaz- 
ingly accurate. The measure of accuracy is a number called the margin of error. The 
sample proportion differs from the population proportion by more than the margin 
of error less than 5% of the time, or in fewer than 1 in 20 surveys. 


As a general rule, the amount by which the proportion obtained from the sam- 
ple will differ from the true population proportion rarely exceeds I divided by 
the square root of the number in the sample. This is expressed by the simple 
formula 1/ Vn, where the letter n represents the number of people in the sam- 
ple. To express results in terms of percentages instead of proportions, simply 
multiply everything by 100. 
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For example, with a sample of 1600 people, we usually get an estimate that is ac- 
curate to within 1/40 = 0.025 = 2.5% of the truth because the square root of 1600 
is 40. You might see results such as “Fifty-five percent of respondents support the 
president’s economic plan. The margin of error for this survey is plus or minus 2.5 
percentage points.” This means that it is almost certain that between 52.5% and 
57.5% of the entire population support the plan. In other words, add and subtract the 
margin of error to the sample value, and the resulting interval almost surely covers 
the true population value. If you were to follow this method every time you read the 
results of a properly conducted survey, the interval would only miss covering the 
truth about | in 20 times. 


EXAMPLE 3 Measuring Teen Drug Use 


gh In News Story 13 in the Appendix, “3 factors key for drug use in kids,” the margin of 
error is provided at the end of the article, as follows: 
wo 
QEV Analytics surveyed 1,987 children ages 12 to 17 and 504 parents. . . . The 
margin of error was plus or minus two percentage points for children and plus or 
minus four percentage points for parents. 


Notice that n = 1987 children were surveyed, so the margin of error is 1/V 1987 = 
0.0224, or about 2.2%. There were 504 parents interviewed, so the margin of error for 
their responses is about 1/V504 = 0.0445, or about 4.45%. These values were 
rounded off in the news story, to 2% and 4%, respectively. The more accurate values of 
2.2% and 4.4% are given on page 30 in Original Source 13, along with an explanation 
of what they mean. 

The margin of error can be applied to any percent reported in the study to find an 
estimate of the percent of the population that would respond the same way. For in- 
stance, the news story reported that 20% of the children in the study said they could 
buy marijuana in an hour or less. Applying the margin of error, we can be fairly confi- 
dent that somewhere between 18% and 22% of all teens in the population represented 
by those in this survey would respond that way if asked. Notice that the news story mis- 
interprets this information when stating that “more than 5 million children ages 12 to 
17, or 20 percent, said they could buy marijuana in an hour or less.” In fact, only a to- 
tal of 1987 children were even asked! The figure of 5 million is a result of multiplying 
the percent of the sample who responded affirmatively (20%) by the total number of 
children in the population of 12- to 17-year-olds in the United States, presumably about 
25 million at the time of the study. E 


Other Advantages of Sample Surveys 


When a Census Isn’t Possible 

Suppose you needed a laboratory test to see if your blood had too high a concentra- 
tion of a certain substance. Would you prefer that the lab measure the entire popula- 
tion of your blood, or would you prefer to give a sample? Similarly, suppose a 
manufacturer of firecrackers wanted to know what percentage of its products were 
duds. It would not make much of a profit if it tested them all, but it could get a rea- 
sonable estimate of the desired percentage by testing a properly selected sample. As 
these examples illustrate, there are situations where measurements destroy the units 
being tested and thus a census is not feasible. 


64 PART 1 Finding Data in Life 


Speed 

Another advantage of a sample survey over a census is the amount of time it takes 
to conduct a sample survey. For example, it takes several years to successfully plan 
and execute a census of the entire population of the United States. Getting monthly 
unemployment rates would be impossible with a census; the results would be quite 
out-of-date by the time they were released. It is much faster to collect a sample than 
a census if the population is large. 


Accuracy 

A final advantage of a sample survey is that you can devote your resources to get- 
ting the most accurate information possible from the sample you have selected. It is 
easier to train a small group of interviewers than a large one, and it is easier to track 
down a small group of nonrespondents than the larger one that would inevitably re- 
sult from trying to conduct a census. 


4.4 Simple Random Sampling 


EXAMPLE 4 


The ability of a relatively small sample to accurately reflect the opinions of a huge 
population does not happen haphazardly. It works only if proper sampling methods 
are used. Everyone in the population must have a specified chance of making it into 
the sample. Methods with this characteristic are called probability sampling plans. 

The simplest way of accomplishing this goal is to use a simple random sample. 
With a simple random sample, every conceivable group of people of the required 
size has the same chance of being the selected sample. 

To actually produce a simple random sample, you need two things. First, you 
need a list of the units in the population. Second, you need a source of random 
numbers. Random numbers can be found in tables designed for that purpose, called 
“tables of random digits,” or they can be generated by computers and calculators. If 
the population isn’t too large, physical methods can be used, as illustrated in the next 
hypothetical example. 


How to Sample from Your Class 


Suppose you are taking a class with 200 students and are unhappy with the teaching 
method. To substantiate that a problem exists so that you can complain to higher pow- 
ers, you decide to collect a simple random sample of 25 students and ask them for their 
opinions. 

Notice that a sample of this size would have a margin of error of about 20% be- 
cause 1/V25 = 1/5 = 0.20. Thus, the percentage of those 25 people who were dis- 
satisfied would almost surely be within 20% of the percentage of the entire class who 
were dissatisfied. If 60% of the sample said they were dissatisfied, you could tell the 
higher powers that somewhere between 40% and 80% of the entire class was proba- 
bly dissatisfied. Although that’s not a very precise statement, it is certainly enough to 
show major dissatisfaction. 

To collect your sample, you would proceed as follows: 
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Step 1: Obtain a list of the students in the class, numbered from 1 to 200. 


Step 2: Obtain 25 random numbers between 1 and 200. One simple way to do this 
would be to write each of the numbers from 1 to 200 on equally sized slips of paper, 
put them in a bag, mix them very well, and draw out 25. However, we will instead use 
a computer program called Minitab to select the 25 numbers. Here is what the program 
and the results look like: 


MTB > set cl 

DATA > 1:200 

DATA > end 

MTB > sample 25 cl c2 

MTB > print c2 

C2 

31 141 35 69 100 182 61 116 191 161 129 120 150 15 84 194 135 101 
44 163 152 39 99 110 36 


Step 3: Locate and interview the people on your list whose numbers were selected. No- 
tice that it is important to try to locate the actual 25 people resulting from this process. 
If you tried to phone someone only once and gave up when you could not reach that 
person, you would bias your results toward people who were home more often. If you 
collected your sample correctly, as described, you would have legitimate data to present 
to the higher powers. a 


4.5 Other Sampling Methods 


By now you may be asking yourself how polling organizations could possibly get a 
numbered list of all voters or of all adults in the country. In truth, they don’t. Instead, 
they rely on more complicated sampling methods. Here we describe a few other 
sampling methods, all of which are good substitutes for simple random sampling in 
most situations. In fact, they often have advantages over simple random sampling. 


Stratified Random Sampling 


Sometimes the population of units falls into natural groups, called strata. For ex- 
ample, public opinion pollsters often take separate samples from each region of the 
country so they can spot regional differences as well as measure national trends. Po- 
litical pollsters may sample separately from each political party to compare opinions 
by party. 

A stratified random sample is collected by first dividing the population of units 
into groups (strata) and then taking a simple random sample from each. For exam- 
ple, the strata might be regions of the country or political parties. You can often rec- 
ognize this type of sampling when you read the results of a survey because the 
results will be listed separately for each of the strata. Stratified sampling has other 
advantages besides the fact that results are available separately by strata. One is that 
different interviewers may work best with different people. For example, people 
from separate regions of the country (South, Northeast, and so on) may feel more 
comfortable with interviewers from the same region. It may also be more convenient 
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to stratify before sampling. If we were interested in opinions of college students 
across the country, it would probably be easier to train interviewers at each college 
rather than to send the same interviewer to all campuses. 

So far we have been focusing on the collection of categorical variables, such as 
opinions or traits people might have. Surveys are also used to collect measurement 
variables, such as age at first intercourse or number of cigarettes smoked per day. We 
are often interested in the population average for such measurements. The accuracy 
with which we can estimate the average depends on the natural variability among the 
measurements. The less variable they are, the more precisely we can assess the pop- 
ulation average on the basis of the sample values. For instance, if everyone in a rel- 
atively large sample reports that his or her age at first intercourse was between 16 
years 3 months and 16 years 4 months, then we can be relatively sure that the aver- 
age age in the population is close to that. However, if reported ages range from 13 
years to 25 years, then we cannot pinpoint the average age for the population nearly 
as accurately. 

Stratified sampling can help to solve the problem of large natural variability. 
Suppose we could figure out how to stratify in a way that allowed little natural vari- 
ability in the answers within each of the strata. We could then get an accurate esti- 
mate for each stratum and combine estimates to get a much more precise answer for 
the group than if we measured everyone together. For example, if we wanted to es- 
timate the average weight gain of newborn babies during the first four days of life, 
we could do so more accurately by dividing the babies into groups based on their ini- 
tial birth weight. Very heavy newborns actually tend to lose weight during the first 
few days, whereas very light ones tend to gain more weight. 


Stratified sampling is sometimes used instead of simple random sampling for 
the following reasons: 
1. We can find individual estimates for each stratum. 


2. If the variable measured gives more consistent values within each of the 
strata than within the whole population, we can get more accurate esti- 
mates of the population values. 

3. If strata are geographically separated, it may be cheaper to sample them 
separately. 


4. We may want to use different interviewers within each of the strata. 


Cluster Sampling 


Cluster sampling is often confused with stratified sampling, but it is actually a rad- 
ically different concept and can be much easier to accomplish. The population units 
are again divided into groups, called clusters, but rather than sampling within each 


CHAPTER 4 How to Get a Good Sample 67 


group, we select a random sample of clusters and measure only those clusters. One 
obvious advantage of cluster sampling is that you need only a list of clusters, instead 
of a list of all individual units. 

For example, suppose we wanted to sample students living in the dormitories at 
a college. If the college had 30 dorms and each dorm had 6 floors, we could consider 
the 180 floors to be 180 clusters of units. We could then randomly select the desired 
number of floors and measure everyone on those floors. Doing so would probably 
be much cheaper and more convenient than obtaining a simple random sample of all 
dormitory residents. 

If cluster sampling is used, the analysis must proceed differently because simi- 
larities may exist among the members of the clusters, and these must be taken into 
account. Numerous books are available that describe proper analysis methods based 
on which sampling plan was employed. (See, for example, Sampling Design and 
Analysis by Sharon Lohr, Duxbury Press, 1999.) 


Systematic Sampling 


Suppose you had a list of 5000 names and telephone numbers from which you 
wanted to select a sample of 100. That means you would want to select 1 of every 
50 people on the list. The first idea that might occur to you is to simply choose every 
50th name on the list. If you did so, you would be using a systematic sampling 
plan. With this plan, you divide the list into as many consecutive segments as you 
need, randomly choose a starting point in the first segment, then sample at that same 
point in each segment. In our example, you would randomly choose a starting point 
in the first 50 names, then sample every 50th name after that. When you were fin- 
ished, you would have selected one person from each of 100 segments, equally 
spaced throughout the list. 

Systematic sampling is often a good alternative to simple random sampling. In a 
few instances, however, it can lead to a biased sample, and common sense must be 
used to avoid those. As an example, suppose you were doing a survey of potential 
noise problems in a high-rise college dormitory. Further, suppose a list of residents 
was provided, arranged by room number, with 20 rooms per floor and two people 
per room. If you were to take a systematic sample of, say, every 40th person on the 
list, you would get people who lived in the same location on every floor—and thus 
a biased sampling of opinions about noise problems. 


Random Digit Dialing 


Most of the national polling organizations in the United States now use a method of 
sampling called random digit dialing. This method results in a sample that approx- 
imates a simple random sample of all households in the United States that have tele- 
phones. The method proceeds as follows. First, they make a list of all possible 
telephone exchanges, where the exchange consists of the area code and the next three 
digits. Using numbers listed in the white pages, they can approximate the proportion 
of all households in the country that have each exchange. They then use a computer 
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to generate a sample that has approximately those same proportions. Next, they use 
the same method to randomly sample banks within each exchange, where a bank 
consists of the next two numbers. Phone companies assign numbers using banks so 
that certain banks are mainly assigned to businesses, certain ones are held for future 
neighborhoods, and so on. Finally, to complete the number, the computer randomly 
generates two digits from 00 to 99. 

Once a phone number has been determined, a well-conducted poll will make 
multiple attempts to reach someone at that household. Sometimes they will ask to 
speak to a male because females are more likely to answer the phone and would thus 
be overrepresented. 


EXAMPLE 5 Finding Teens and Parents Willing to Talk 


gm The survey described in News Story 13 in the Appendix was conducted by telephone and 
Original Source 13 on the CD describes in detail how the sample was obtained. The re- 
wo searchers started with an “initial pool of random telephone numbers” consisting of 
94,184 numbers, which “represented all 48 continental states in proportion to their 
population, and were prescreened by computer to eliminate as many unassigned or 
nonresidential telephone numbers as possible” (p. 29). 
Despite the prescreening, the initial pool of 94,184 numbers eventually resulted in 
only 1987 completed interviews! There is a detailed table of why this is the case on page 
31 of the report. For instance, 12,985 of the numbers were “not in service.” Another 
25,471 were ineligible because there was no resident in the required age group, 12 to 
17 years old. Another 27,931 refused to provide the information that was required to 
know whether the household qualified. Only 8597 were abandoned because of no an- 
swer, partly because “at least four call back attempts were made to each telephone 
number before the telephone number was rejected” (p. 29). 
An important question is whether any of the reasons for exclusion were likely to in- 
troduce significant bias in the results. The report does address this question with respect 
to one reason, refusal on the part of a parent to allow the teen to participate: 


While the refusal rate of parents, having occurred in 544 cases, seems modest, 

this represents the loss of 11 percent of other eligible households, which is sub- 
stantial enough to have an impact on the achieved sample. This may be a con- 
tributing factor to the understatement of substance use rates, and to the 
underrepresentation of racial and ethnic populations. (p. 30) E 


Multistage Sampling 


Many large surveys, especially those that are conducted in person rather than over 
the telephone, use a combination of the methods we have discussed. They might 
stratify by region of the country; then stratify by urban, suburban, and rural; and then 
choose a random sample of communities within those strata. They would then divide 
those communities into city blocks or fixed areas, as clusters, and sample some of 
those. Everyone on the block or within the fixed area may then be sampled. This is 
called a multistage sampling plan. 
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4.6 Difficulties and Disasters in Sampling 


Difficulties 
1. Using the wrong sampling frame 
2. Not reaching the individuals selected 


3. Having a low response rate 
Disasters 


1. Getting a volunteer or self-selected sample 


2. Using a convenience or haphazard sample 


In theory, designing a good sampling plan is easy and straightforward. However, the 
real world rarely cooperates with well-designed plans, and trying to collect a proper 
sample is no exception. Difficulties that can occur in practice need to be considered 
when you evaluate a study. If a proper sampling plan is never implemented, the con- 
clusions can be misleading and inaccurate. 


Difficulties in Sampling 


Following are some problems that can occur even when a sampling plan has been 
well designed. 


Using the Wrong Sampling Frame 

Remember that the sampling frame is the list of the population of units from which 
the sample is drawn. Sometimes a sampling frame either will include unwanted units 
or exclude desired units. For example, using a list of registered voters to predict elec- 
tion outcomes includes those who are not likely to vote as well as those who are 
likely to do so. Using a telephone directory to survey the general population ex- 
cludes those who move often, those with unlisted home numbers (such as many 
physicians and teachers), and those who cannot afford a telephone. 

Common sense can often lead to a solution for this problem. In the example of 
registered voters, interviewers may try to first ascertain the voting history of the per- 
son contacted by asking where he or she votes and then continuing the interview 
only if the person knows the answer. Instead of using a telephone directory, surveys 
use random digit dialing. This solution still excludes those without phones but not 
those who didn’t happen to be in the last printed directory. 


Not Reaching the Individuals Selected 
Even if a proper sample of units is selected, the units may not be reached. For 
example, Consumer Reports magazine mails a lengthy survey to its subscribers to 
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obtain information on the reliability of various products. If you were to receive such 
a survey, and you had a close friend who had been having trouble with a highly rated 
automobile, you may very well decide to pass the questionnaire on to your friend to 
answer. That way, he would get to register his complaints about the car, but Con- 
sumer Reports would not have reached the intended recipient. 

Telephone surveys tend to reach a disproportionate number of women because 
they are more likely to answer the phone. To try to counter that problem, researchers 
sometimes ask to speak to the oldest adult male at home. Surveys are also likely to 
have trouble contacting people who work long hours and are rarely home or those 
who tend to travel extensively. 

In recent years, news organizations have been pressured to produce surveys of 
public opinion quickly. When a controversial story breaks, people want to know how 
others feel about it. This pressure results in what Wall Street Journal reporter Cyn- 
thia Crossen calls “quickie polls.” As she notes, these are “most likely to be wrong 
because questions are hastily drawn and poorly pretested, and it is almost impossi- 
ble to get a random sample in one night” (Crossen, 1994, p. 102). Even with the 
computer randomly generating phone numbers for the sample, many people are not 
likely to be home that night—and they may have different opinions from those who 
are likely to be home. Most responsible reports about polls include information 
about the dates during which they were conducted. If a poll was done in one night, 
beware! It is important that once a sample has been selected, those individuals are 
the ones who are actually measured. It is better to put resources into getting a smaller 
sample than to get one that has been biased because the survey takers moved on to 
the next person on the list when a selected individual was initially unavailable. 


Having a Low Response Rate 

Even the best surveys are not able to contact everyone on their list, and not everyone 
contacted will respond. The General Social Survey (GSS), run by the prestigious Na- 
tional Opinion Research Center (NORC) at the University of Chicago, noted in its 
September 1993 GSS News: 


In 1993 the GSS achieved its highest response rate ever, 82.4%. This is five 
percentage points higher than our average over the last four years. Given the 
long length of the GSS (90 minutes), the high response rates of the GSS are 
testimony to the extraordinary skill and dedication of the NORC field staff: 


Beyond having a dedicated staff, not much can be done about getting everyone in the 
sample to respond. Response rates should simply be reported in research summaries. 
As a reader, remember that the lower the response rate, the less the results can be 
generalized to the population as a whole. Responding to a survey (or not) is volun- 
tary, and those who respond are likely to have stronger opinions than those who 
do not. 

With mail surveys, it may be possible to compare those who respond immedi- 
ately with those who need a second prodding, and in telephone surveys you could 
compare those who are home on the first try with those who require numerous call- 
backs. If those groups differ on the measurement of interest, then those who were 
never reached are probably different as well. 


EXAMPLE 6 


EXAMPLE 7 
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In a mail survey, it is best not to rely solely on “volunteer response.” In other 
words, don’t just accept that those who did not respond the first time can’t be cajoled 
into it. Often, sending a reminder with a brightly colored stamp or following up with 
a personal phone call will produce the desired effect. Surveys that simply use those 
who respond voluntarily are sure to be biased in favor of those with strong opinions 
or with time on their hands. 


Which Scientists Trashed the Public? 

According to a poll taken among scientists and reported in the prestigious journal 
Science (Mervis, 1998), scientists don’t have much faith in either the public or the me- 
dia. The article reported that, based on the results of a “recent survey of 1400 profes- 
sionals” in science and in journalism, 82% of scientists “strongly or somewhat agree” 
with the statement “the U.S. public is gullible and believes in miracle cures or easy so- 
lutions,” and 80% agreed that “the public doesn’t understand the importance of fed- 
eral funding for research.” About the same percentage (82%) also trashed the media, 
agreeing with the statement “the media do not understand statistics well enough to ex- 
plain new findings.” It isn’t until the end of the article that we learn who responded: 
“The study reported a 34% response rate among scientists, and the typical respondent 
was a white, male physical scientist over the age of 50 doing basic research.” Remem- 
ber that those who feel strongly about the issues in a survey are the most likely to re- 
spond. With only about a third of those contacted responding, it is inappropriate to 
generalize these findings and conclude that most scientists have so little faith in the pub- 
lic and the media. This is especially true because we were told that the respondents rep- 
resented only a narrow subset of scientists. a 


Disasters in Sampling 


A few sampling methods are so bad that they don’t even warrant a further look at the 
study or its results. 


Getting a Volunteer or Self-Selected Sample 

Although relying on volunteer responses presents somewhat of a difficulty in deter- 
mining the extent to which surveys can be generalized, relying on a volunteer sam- 
ple is a complete waste of time. If a magazine, Web site or television station runs a 
survey and asks any readers or viewers who are interested to respond, the results re- 
flect only the opinions of those who decide to volunteer. As noted earlier, those who 
have a strong opinion about the question are more likely to respond than those who 
do not. Thus, the responding group is simply not representative of any larger group. 
Most media outlets now acknowledge that such polls are “unscientific” when they 
report the results, but most readers are not likely to understand how misleading the 
results can be. The next example illustrates the contradiction that can result between 
a scientific poll and one relying solely on a volunteer sample. 


A Meaningless Poll 


On February 18, 1993, shortly after Bill Clinton became president of the United States, 
a television station in Sacramento, California, asked viewers to respond to the question: 
“Do you support the president’s economic plan?” The next day, the results of a properly 
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conducted study asking the same question were published in the newspaper. Here are 
the results: 


Television Poll Survey 
(Volunteer sample) (Random sample) 
Yes (support plan) 42% 75% 
No (don’t support plan) 58% 18% 
Not sure 0% 7% 


As you can see, those who were dissatisfied with the president's plan were much more 
likely to respond to the television poll than those who supported it, and no one who 
was “Not sure” called the television station because they were not invited to do so. Try- 
ing to extend those results to the general population is misleading. It is irresponsible to 
publicize such studies, especially without a warning that they result from an unscien- 
tific survey and are not representative of general public opinion. You should never in- 
terpret such polls as anything other than a count of who bothered to go to the 
telephone and call. E 


Using a Convenience or Haphazard Sample 

Another sampling technique that can produce misleading results for surveys is to use 
the most convenient group available or to decide on the spot who to sample. In most 
cases, the group is not likely to represent any larger population for the information 
measured. In some cases, the respondents may be similar enough to a population of 
interest that the results can be extended, but extreme caution should be used in de- 
ciding whether this is likely to be so. For example, students in introductory psy- 
chology or statistics classes may be representative of all students at a university on 
issues like extent of drug use in the high school they attended, but not on issues like 
how many hours they study each week. 


Haphazard Sampling 


A few years ago, the student newspaper at a California university announced as a front 
page headline: “Students ignorant, survey says.” The article explained that a “random 
survey” indicated that American students were less aware of current events than inter- 
national students were. However, the article quoted the undergraduate researchers, 
who were international students themselves, as saying that “the students were ran- 
domly sampled on the quad.” The quad is an open-air, grassy area where students re- 
lax, eat lunch, and so on. There is simply no proper way to collect a random sample of 
students by selecting them in an area like that. In such situations, the researchers are 
likely to approach people who they think will support the results they intended for their 
survey. Or, they are likely to approach friendly looking people who appear as though 
they will easily cooperate. This is called a haphazard sample, and it cannot be expected 
to be representative at all. E 


You have seen the proper way to collect a sample and have been warned about 
the many difficulties and dangers inherent in the process. We finish the chapter with 
a famous example that helped researchers learn some of these pitfalls. 


CASE STUDY 4.1 
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The Infamous Literary Digest Poll of 1936 


Before the election of 1936, a contest between Democratic incumbent Franklin De- 
lano Roosevelt and Republican Alf Landon, the magazine Literary Digest had been 
extremely successful in predicting the results in U.S. presidential elections. But 1936 
turned out to be the year of its downfall, when it predicted a 3-to-2 victory for Lan- 
don. To add insult to injury, young pollster George Gallup, who had just founded the 
American Institute of Public Opinion in 1935, not only correctly predicted Roosevelt 
as the winner of the election, he also predicted that the Literary Digest would get it 
wrong. He did this before the magazine even conducted its poll. And Gallup sur- 
veyed only 50,000 people, whereas the Literary Digest sent questionnaires to 10 mil- 
lion people (Freedman, Pisani, Purves, and Adhikari, 1991, p. 307). 

The Literary Digest made two classic mistakes. First, the lists of people to whom 
it mailed the 10 million questionnaires were taken from magazine subscribers, car 
owners, telephone directories, and, in just a few cases, lists of registered voters. In 
1936, those who owned telephones or cars, or subscribed to magazines, were more 
likely to be wealthy individuals who were not happy with the Democratic incum- 
bent. The sampling frame did not match the population of interest. 

Despite what many accounts of this famous story conclude, the bias produced by 
the more affluent list was not likely to have been as severe as the second problem 
(Bryson, 1976). The main problem was a low response rate. The magazine received 
2.3 million responses, a response rate of only 23%. Those who felt strongly about 
the outcome of the election were most likely to respond. And that included a major- 
ity of those who wanted a change, the Landon supporters. Those who were happy 
with the incumbent were less likely to bother to respond. 

Gallup, however, knew the value of random sampling. He was able not only to 
predict the election but to predict the results of the Literary Digest poll within 1%. 
How did he do this? According to Freedman and colleagues (1991, p. 308), “he just 
chose 3000 people at random from the same lists the Digest was going to use, and 
mailed them all a postcard asking them how they planned to vote.” This example il- 
lustrates the beauty of random sampling and the idiocy of trying to base conclusions 
on nonrandom and biased samples. The Literary Digest went bankrupt the following 
year, and so never had a chance to revise its methods. The organization founded by 
George Gallup has flourished, although not without making a few sampling blunders 
of its own (see, for example, Exercise 11). 


Exercises 


Asterisked (*) exercises are included in the Solutions at the back of the book. 


*1. For each of the following situations, state which type of sampling plan was used. 
Explain whether you think the sampling plan would result in a biased sample. 
*a. To survey the opinions of its customers, an airline company made a list of all 
its flights and randomly selected 25 flights. All of the passengers on those 
flights were asked to fill out a survey. 
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2. 


*4, 


"5. 


*b. A pollster interested in opinions on gun control divided a city into city 
blocks, then surveyed the third house to the west of the southeast corner of 
each block. If the house was divided into apartments, the westernmost 
ground floor apartment was selected. The pollster conducted the survey dur- 
ing the day, but left a notice for those who were not at home to phone her so 
she could interview them. 


*e. To learn how its employees felt about higher student fees imposed by the leg- 
islature, a university divided employees into three categories: staff, faculty, 
and student employees. A random sample was selected from each group and 
they were telephoned and asked for their opinions. 


*d. A large variety store wanted to know if consumers would be willing to pay 
slightly higher prices to have computers available throughout the store to 
help them locate items. The store posted an interviewer at the door and told 
her to collect a sample of 100 opinions by asking the next person who came 
in the door each time she had finished an interview. 


Explain the difference between a proportion and a percentage as used to present 
the results of a sample survey. Include an explanation of how you would convert 
results from one form to the other. 


. Construct an example in which a systematic sampling plan would result in a bi- 


ased sample. 


In the March 8, 1994, edition of the Scotsman, a newspaper published in Edin- 
burgh, Scotland, a headline read, “Reform study finds fear over schools.” The 
article described a survey of 200 parents who had been asked about proposed 
education reforms and indicated that most parents felt uninformed and thought 
the reforms would be costly and unnecessary. The report did not clarify whether 
a random sample was chosen, but make that assumption in answering the fol- 
lowing questions. 


*a, What is the margin of error for this survey? 


*b. It was reported that “about 80 percent added that they were satisfied with the 
current education set-up in Scotland.” What is the range of values that al- 
most certainly covers the percentage of the population of parents who were 
satisfied? 


c. The article quoted Lord James Douglas-Hamilton, the Scottish education 
minister, as saying, “If you took a similar poll in two years’ time, you would 
have a different result.’ Comment on this statement. 


An article in the Sacramento Bee (12 January 1998, p. A4) was titled “College 
freshmen show conservative side” and reported the results of a fall 1997 survey 
“based on responses from a representative sample of 252,082 full-time freshmen 
at 464 two- and four-year colleges and universities nationwide.” The article did 
not explain how the schools or students were selected. 


a. For this survey, explain what a unit is, what the population is, and what the 
sample is. 


b. Assuming a random sample of students was selected at each of the 464 
schools, what type of sample was used in this survey? Explain. 


6. 


*9, 


10. 


11. 
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*c. Now assume that the 464 schools were randomly selected from all eligible 
colleges and universities and that all first-year students at those schools were 
surveyed. Explain what type of sample was used in the survey. 


d. Why would one of the two sampling methods described in parts b and c have 
been simpler to implement than a simple random sample of all first-year col- 
lege students in the United States? 


The survey in Exercise 5 has been conducted annually by the Higher Education 
Research Institute at UCLA since 1966. One of the results reported was that 
“students’ disengagement from politics continues. The percentage of freshmen 
believing that ‘keeping up to date with political affairs’ is important fell to 26.7 
percent, down from 29.4 percent a year ago [in 1996] and a high of 57.8 percent 
in 1966.” In 1966, college students were in the midst of protesting the Vietnam 
War, and in 1996 there was a presidential election. Do you think the results of 
this survey indicate that first-year college students have become more apathetic 
in general? Explain. 


. Specify the population and the sample, being sure to include both units and mea- 


surements, for the situation described in 
a. Exercise la 
b. Exercise 1b 
c. Exercise lc 
d. Exercise 1d 


. Give an example in which 


a. A sample would be preferable to a census 
b. A cluster sample would be the easiest method to use 
c. A systematic sample would be the easiest to use and would not be biased 
Explain whether a survey or a randomized experiment would be most appropri- 
ate to find out about each of the following: 

*a, Who is likely to win the next presidential election 

*b. Whether the use of nicotine gum reduces cigarette smoking 
c. Whether there is a relationship between height and happiness 


d. Whether a public service advertising campaign has been effective in pro- 
moting the use of condoms 


Find a news article describing a survey that is obviously biased. Explain why 
you think it is biased. 


Despite his success in 1936, George Gallup failed miserably in trying to predict 
the winner of the 1948 U.S. presidential election. His organization, as well as 
two others, predicted that Thomas Dewey would beat incumbent Harry Truman. 
All three used what is called “quota sampling.” The interviewers were told to 
find a certain number, or quota, of each of several types of people. For example, 
they might have been told to interview six women under age 40, one of whom 
was black and the other five of whom were white. Imagine that you are one of 
their interviewers trying to follow these instructions. Who would you ask? Now 
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*12. 


13. 


14. 


*15. 


16. 


explain why you think these polls failed to predict the true winner and why 
quota sampling is not a good method. 


Explain the difference between a low response rate and a volunteer sample. Ex- 
plain which is worse, and why. 


Explain why the main problem with the Literary Digest poll is described as “low 
response rate” and not “volunteer sample.” 


Gastwirth (1988, p. 507) describes a court case in which Bristol-Myers was or- 
dered by the Federal Trade Commission to stop advertising that “twice as many 
dentists use Ipana [toothpaste] as any other dentifrice” and that more dentists 
recommended it than any other dentifrice. Bristol-Myers had based its claim on 
a survey of 10,000 randomly selected dentists from a list of 66,000 subscribers 
to two dental magazines. They received 1983 responses, with 621 saying they 
used Ipana and only 258 reporting that they used the second most popular brand. 
As for the recommendations, 461 respondents recommended Ipana, compared 
with 195 for the second most popular choice. 


a. Specify the sampling frame for this survey, and explain whether you think 
“using the wrong sampling frame” was a difficulty here, based on what 
Bristol-Myers was trying to conclude. 


b. Of the remaining four “difficulties and disasters in sampling” listed in Sec- 
tion 4.6 (other than “using the wrong sampling frame”), which do you think 
was the most serious in this case? Explain. 


c. What could Bristol-Myers have done to improve the validity of the results af- 
ter it had mailed the 10,000 surveys and received 1983 back? Assume the 
company kept track of who had responded and who had not. 


A survey in Newsweek (14 November 1994, p. 54) asked: “Does the Senate gen- 
erally pay too much attention to personal lives of people nominated to high of- 
fice, or not enough?” Fifty-six percent of the respondents said “too much 
attention.” It was also reported that “for this Newsweek poll, Princeton Survey 
Research Associates telephoned 756 adults Nov. 3—4. The margin of error is +4 
percentage points.” 


a. Verify that the margin of error reported by Newsweek is consistent with the 
rule given in this chapter for finding the approximate margin of error. 


*b. Based on these sample results, are you convinced that a majority of the pop- 
ulation (that is, over 50%) think that the Senate pays too much attention? 
Explain. 


The student newspaper at a university in California reported a debate between 
two student council members, revolving around a survey of students (California 
Aggie, 8 November 1994, p. 3). The newspaper reported that “according to an 
AS [Associated Students] Survey Unit poll, 52 percent of the students surveyed 
said they opposed a diversity requirement.” The report said that one council 
member “claimed that the roughly 500 people polled were not enough to guar- 
antee a statistically sound cross section of the student population.” Another 
council member countered by saying that “three percent is an excellent random 


*17. 


*18. 


19. 


20. 


21. 


*22. 
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sampling, so there’s no reason to question accuracy.” (Note that the 3% figure is 
based on the fact that there were about 17,000 undergraduate students currently 
enrolled at that time.) 


a. Comment on the remark attributed to the first council member, that the sam- 
ple size is not large enough to “guarantee a statistically sound cross section 
of the population.” Is the size of the sample the relevant issue to address his 
concern? 


b. Comment on the remark by the second council member that “three percent 
is an excellent random sampling, so there’s no reason to question accuracy.” 
Is she correct in her use of terminology and in her conclusion? 


c. Assuming a random sample was used, produce an interval that almost cer- 
tainly covers the true percentage of the population of students who oppose 
the diversity requirement. Use your result to comment on the debate. In par- 
ticular, do these results allow a conclusion as to whether the majority of stu- 
dents on campus oppose the requirement? 


Identify each of the following studies as a survey, an experiment, an observa- 
tional study, or a case study. Explain your reasoning. 


*ą. A doctor claims to be able to cure migraine headaches. A researcher admin- 
isters a questionnaire to each of the patients the doctor claims to have cured. 


b. Patients who visit a clinic to help them stop smoking are given a choice of 
two treatments: undergoing hypnosis or applying nicotine patches. The per- 
centages who quit are compared for the two methods. 


c. A large company wants to compare two incentive plans for increasing sales. 
The company randomly assigns a number of its sales staff to receive each 
kind of incentive and compares the average change in sales of the employ- 
ees under the two plans. 


Is using a convenience sample an example of a probability sampling plan? Ex- 
plain why or why not. 


What role does natural variability play when trying to determine the population 
average of a measurement variable from a sample? 


Suppose that a gourmet food magazine wants to know how its readers feel about 
serving beer with various types of food. The magazine sends surveys to 1000 
randomly selected readers. Explain which one of the “difficulties and disasters” 
in sampling the magazine is most likely to face. 


Suppose you had a student telephone directory for your local college and wanted 
to sample 100 students. Explain how you would obtain each of the following: 
a. A simple random sample 

b. A systematic sample 

Suppose you have a telephone directory for your local college from which you 
randomly select 100 names. To find out how students feel about a new pub on 
campus, you call the 100 numbers and interview the person who answers the 


phone. Explain which one of the “difficulties and disasters” in sampling you are 
most likely to encounter and how it could bias your results. 
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23. 


24. 


25. 


*26. 


27. 


The U.S. government uses a multitude of surveys to measure opinions, behav- 
iors, and so on. Yet, every 10 years it takes a census. What can the government 
learn from a census that it could not learn from a sample survey? 


The Sacramento Bee (11 Feb. 2001, p. A20) reported on a Newsweek poll that 
was based on interviews with 1000 adults, asking questions about a variety of 
issues. 


a. What is the margin of error for this poll? 


b. One of the statements in the news story was “a margin of error of plus or 
minus three percentage points means that the 43 percent of Americans for 
and the 48 percent of Americans against oil exploration in Alaska’s Arctic 
National Wildlife Refuge are in a statistical dead heat.” Explain what is 
meant by this statement. 


In early September, 2003, California’s Governor Gray Davis approved a contro- 
versial law allowing people who were not legal residents to obtain a California 
state driver’s license. That week the California Field Poll released a survey 
showing that 59% of registered voters opposed the law and 34% supported it. 
This part of the survey was based on a random sample of just over 300 people. 


a. What is the approximate margin of error for the Field Poll results? 


b. Provide an interval that is likely to cover the true percentage of registered 
California voters who supported the law. 


Refer to the previous exercise. The same week that the Field Poll was released 
a Web site called SFGate.com (http://www.sfgate.com/polls/) asked visitors to 
“Click to vote” on their preferred response to “Agree with new law allowing dri- 
vers’ licenses for illegal immigrants?” The choices and the percent who chose 
them were “Yes, gesture of respect, makes roads safe” 19%, “No, thwarts im- 
migration law, poses security risk” 79%, and “Oh, great, another messy ballot 
battle” 2%. The total number of votes shown was 2900. 


a. What type of sample was used for this poll? 


b. Explain likely reasons why the percent who supported the law in this poll 
(19%) differed so much from the percent who supported it in the Field Poll 
(34%). 


*e. Which result, the one from the SFGate poll or from the Field Poll, do you 
think was more likely to represent the opinion of the population of registered 
California voters at that time? Explain. 


Each of the following quotes is based on the results of an experiment or an ob- 
servational study. Explain which was used. If an observational study was used, 
explain whether an experiment could have been used to study the topic 
instead. 


a. “A recent Stanford study of more than 6000 men found that tolerance for ex- 
ercise (tested on a treadmill) was a stronger predictor of risk of death than 
high blood pressure, smoking, diabetes, high cholesterol and heart disease” 
(Kalb, 2003, p. 64). 


CHAPTER 4 How to Get a Good Sample 719 


b. “On-the-job criticism may hurt your back as well as your feelings, re- 
searchers report. [They] evaluated 25 college student volunteers. The stu- 
dents, wearing a device that monitors motion and measures stresses on the 
spine, were asked to lift a 25-pound box under different emotional circum- 
stances” (Reuters Health, Dec 2, 2000; http://dailynews.yahoo.com/h/nm/ 
20001202/h1/work_1.html). 


For Exercises 28 to 30, locate the News Story in the Appendix and Original Source 
on the CD. In each case, consult the Original Source and then explain what type of 
sample was used. Then discuss whether you think the results can be applied to any 
larger population on the basis of the type of sample used. 


*28. 


29. 


30. 


Original Source 2: “Development and initial validation of the Hangover Symp- 
toms Scale: Prevalence and correlates of hangover symptoms in college 
students.” 


Original Source 7: “Auto body repair inspection pilot program: Report to the 
legislature.” 


Original Source 10: “Religious attendance and cause of death over 31 years.” 


Mini-Projects 


. For this project, you will use the telephone directory for your community to es- 


timate the percentage of households that list their phone number but not their ad- 
dress. Use two different sampling methods, chosen from simple random 
sampling, stratified sampling, cluster sampling, or systematic sampling. In each 
case, sample about 100 households and figure out the proportion of those who 
do not list an address. 


a. Explain exactly how you chose your samples. 

b. Explain which of your two methods was easier to use. 

c. Do you think either of your methods produced biased results? Explain. 
d 


. Report your results, including a margin of error. Are your estimates from the 
two methods in agreement with each other? 


Go to a large parking lot or a large area where bicycles are parked. Choose a 

color or a manufacturer. Design a sampling scheme you can use to estimate the 

percentage of cars or bicycles of that color or model. In choosing the number to 

sample, consider the margin of error that will accompany your sample result. 

Now go through the entire area, actually taking a census, and compute the pop- 

ulation percentage of cars or bicycles of that type. 

a. Explain your sampling method and discuss any problems or biases you en- 
countered in using it. 

b. Construct an interval from your sample that almost surely covers the true 
population percentage with that characteristic. Does your interval cover the 
true population percentage you found when you took the census? 
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c. Use your experience with taking the census to name one practical difficulty 
with taking a census. (Hint: Did all the cars or bicycles stay put while you 
counted?) 
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CHAPTER 


Experiments and 
Observational Studies 


Thought Questions 


1. 


In conducting a study to relate two conditions (activities, traits, and so on), re- 
searchers often define one of them as the explanatory variable and the other as the 
outcome or response variable. In a study to determine whether surgery or 
chemotherapy results in higher survival rates for a certain type of cancer, whether the 
patient survived is one variable, and whether the patient received surgery or 
chemotherapy is the other. Which is the explanatory variable and which is the re- 
sponse variable? 


. In an experiment, researchers assign “treatments” to participants, whereas in an ob- 


servational study, they simply observe what the participants do naturally. Give an ex- 
ample of a situation where an experiment would not be feasible for ethical reasons. 


. Suppose you are interested in determining whether a daily dose of vitamin C helps 


prevent colds. You recruit 20 volunteers to participate in an experiment. You want 
half of them to take vitamin C and the other half to agree not to take it. You ask 
them each which they would prefer, and 10 say they would like to take the vitamin 
and the other 10 say they would not. You ask them to record how many colds they 
get during the next 10 weeks. At the end of that time, you compare the results re- 
ported from the two groups. Give three reasons why this is not a good experiment. 


. When experimenters want to compare two treatments, such as an old and a new 


drug, they use randomization to assign the participants to the two conditions. If you 
had 50 people participate in such a study, how would you go about randomizing 
them? Why do you think randomization is necessary? Why shouldn't the experi- 
menter decide which people should get which treatment? 


. “Graduating is good for your health,” according to a headline in the Boston Globe 


(3 April 1998, p. A25). The article noted, “According to the Center for Disease Con- 
trol, college graduates feel better emotionally and physically than do high school 
dropouts.” Do you think the headline is justified based on this statement? Explain 
why or why not. 
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5.1 Defining a Common Language 


In this chapter, we focus on studies that attempt to detect relationships between vari- 
ables. In addition to the examples seen in earlier chapters, some of the connections 
we examine in this chapter and the next include a relationship between baldness and 
heart attacks in men, between smoking during pregnancy and subsequent lower IQ 
in the child, between listening to Mozart and scoring higher on an IQ test, and be- 
tween handedness and age at death. We will see that some of these connections are 
supported by properly conducted studies, whereas other connections are not as solid. 


Explanatory Variables, Response Variables, 
and Treatments 


In most studies, we imagine that if there is a causal relationship, it occurs in a par- 
ticular direction. For example, if we found that left-handed people die at a younger 
age than right-handed people, we could envision reasons why their handedness 
might be responsible for the earlier death, such as accidents resulting from living in 
a right-handed world. It would be more difficult to argue that they were left-handed 
because they were going to die at an earlier age. 


Explanatory Variables versus Response Variables 

We define an explanatory variable to be one that attempts to explain or is purported 
to cause (at least partially) differences in a response variable (sometimes called an 
outcome variable). In the previous example, handedness would be the explanatory 
variable and age at death the response variable. In the Salk experiment described in 
Chapter 1, whether the baby listened to a heartbeat was the explanatory variable and 
weight gain was the response variable. In a study comparing chemotherapy to 
surgery for cancer, the medical treatment is the explanatory variable and surviving 
(usually measured as surviving for 5 years) or not surviving is the response variable. 
Many studies have more than one explanatory variable for each response variable, 
and there may be multiple response variables. The goal is to relate one or more ex- 
planatory variables to each response variable. 

Usually we can distinguish which variable is which, but occasionally we exam- 
ine relationships in which there is no conceivable causal connection. An example is 
the apparent relationship between baldness and heart attacks. Because the level of 
baldness was measured at the time of the heart attack, the heart attack could not have 
caused the baldness. It would be farfetched to assume that baldness results in such 
stress that men are led to have heart attacks. Instead, a third variable may be causing 
both the baldness and the heart attack. In such cases, we simply refer to the variables 
generically and do not assign one to be the explanatory variable and one to be the re- 
sponse variable. 


Treatments 
Sometimes the explanatory variable takes the form of a manipulation applied by the 
experimenter, such as when Salk played the sound of a heartbeat for some of the 
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babies. A treatment is one or a combination of categories of the explanatory vari- 
able(s) assigned by the experimenter. The plural term treatments incorporates a col- 
lection of conditions, each of which is one treatment. In Salk’s experiment there 
were two treatments: Some babies received the heartbeat treatment and others re- 
ceived the silent treatment. 

For the study described in News Story 1 in the Appendix, some participants were 
assigned to follow an 8-week meditation regime and the others were not. The two 
treatments were “meditation routine” and “control,’ where the control group was 
measured for the response variables at the same times as the meditation group. Re- 
sponse variables were brain electrical activity and immune system functioning. The 
goal was to ascertain the effect of meditation on these response variables. This study 
is explored in Example 1 on page 86. 


Randomized Experiment versus Observational Studies 


Ideally, if we were trying to ascertain the connection between the explanatory and 
response variables, we would keep everything constant except the explanatory vari- 
able. We would then manipulate the explanatory variable and notice what happened 
to the response variable as a consequence. We rarely reach this ideal, but we can 
come closer with an experiment than with an observational study. 

In a randomized experiment, we create differences in the explanatory variable 
and then examine the results. In an observational study we observe differences in 
the explanatory variable and then notice whether these are related to differences in 
the response variable. 

For example, suppose we wanted to detect the effects of the explanatory variable 
“smoking during pregnancy” on the response variable “‘child’s IQ at 4 years of age.” 
In a randomized experiment, we would randomly assign half of the mothers to 
smoke during pregnancy and the other half to not smoke. In an observational study, 
we would merely record smoking behavior. This example demonstrates why we 
can’t always perform an experiment. 


Two reasons why we must sometimes use an observational study instead of 
an experiment: 
1. It is unethical or impossible to assign people to receive a specific treatment. 


2. Certain explanatory variables, such as handedness, are inherent traits and 
cannot be randomly assigned. 


Confounding Variables and Interacting Variables 
Confounding Variables 


A confounding variable is one that has two properties. First, a confounding vari- 
able is related to the explanatory variable in the sense that individuals who differ for 


84 


PART 1 


Finding Data in Life 


the explanatory variable are also likely to differ for the confounding variable. Sec- 
ond, a confounding variable affects the response variable. Because of these two 
properties, the effect of a confounding variable on the response variable cannot be 
separated from the effect of the explanatory variable on the response variable. 

For instance, suppose we are interested in the relationship between smoking dur- 
ing pregnancy and child’s subsequent IQ a few years after birth. The explanatory 
variable is whether or not the mother smoked during pregnancy, and the response 
variable is subsequent IQ of the child. But if we notice that women who smoke dur- 
ing pregnancy have children with lower IQs than the children of women who don’t 
smoke, it could be because women who smoke also have poor nutrition, or lower 
levels of education, or lower income. In that case, mother’s nutrition, education, and 
income would all be confounding variables. They are likely to differ for smokers and 
nonsmokers, and they are likely to affect the response, subsequent IQ of the child. 
The effect of these variables on the child’s IQ cannot be separated from the effect of 
smoking, which was the explanatory variable of interest. 

Confounding variables are a bigger problem in observational studies than in ex- 
periments. In fact, one of the major advantages of an experiment over an observa- 
tional study is that in an experiment, the researcher attempts to control for 
confounding variables. In an observational study, the best the researcher can hope to 
do is measure possible confounding variables and see if they are also related to the 
response variable. 


Interactions Between Variables 

When you read the results of a study, you should also be aware that there may be in- 
teractions between explanatory variables. An interaction occurs when the effect of one 
explanatory variable on the response variable depends on what’s happening with an- 
other explanatory variable. For example, if smoking during pregnancy reduces IQ when 
the mother does not exercise, but raises or does not influence IQ when the mother does 
exercise, then we would say that smoking interacts with exercise to produce an effect 
on IQ. Notice that if two variables do interact, it is important that the results be given 
separately for each combination. To simply say that smoking lowers IQ when, in fact, 
it only did so for those who didn’t exercise would be a misleading conclusion. 


Experimental Units, Subjects, and Volunteers 


In addition to humans, it is common for studies to be performed on plants, animals, 
machine parts, and so on. To have a generic term for this conglomeration of possi- 
bilities, we define experimental units to be the smallest basic objects to which we 
can assign different treatments in a randomized experiment and observational units 
to be the objects or people measured in any study. 

The terms participants or subjects are commonly used when the observational 
units are people. In most cases, the participants in studies are volunteers. Sometimes 
they are passive volunteers, such as when all patients treated at a particular medical 
facility are asked to sign a consent form agreeing to participate in a study. 

Often, researchers recruit volunteers through the newspaper. For example, a 
weekly newspaper in a small town near a research university in California ran an 
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article with the headline, “Volunteers sought for silicone study” (Winters (CA) Ex- 
press, 16 December 1993, p. 8). The article explained that researchers at a local med- 
ical school were “seeking 100 women with silicone breast implants and 100 without 
who are willing to fast overnight and then give a blood sample.” The article also ex- 
plained who was doing the research and its purpose. 

Notice that by recruiting volunteers for studies, the results cannot necessarily be 
extended to the larger population. For example, if volunteers are enticed to partici- 
pate by receiving a small payment or free medical care, as is often the case, those 
who respond are more likely to be from lower socioeconomic backgrounds. Com- 
mon sense should enable you to figure out if this is likely to be a problem, but re- 
searchers should always report the source of their participants so you can judge this 
for yourself. 


5.2 Designing a Good Experiment 


Designing a flawless experiment is extremely difficult, and carrying one out is prob- 
ably impossible. Nonetheless, there are ideals to strive for, and in this section, we in- 
vestigate those first. We then explore some of the pitfalls that are still quite prevalent 
in research today. 


Randomization: The Fundamental 
Feature of Experiments 


Experiments are supposed to reduce the effects of confounding variables and other 
sources of bias that are necessarily present in observational studies. They do so by 
using a simple principle called randomization. 

Randomization in experiments is related to the idea of random selection dis- 
cussed in Chapter 4, when we described how to choose a sample for a survey. There, 
we were concerned that everyone in the population had a specified probability of 
making it into the sample. In randomized experiments, we are concerned that each 
of the experimental units (people, animals, and so on) has a specified probability of 
receiving any of the potential treatments. For example, Salk should have ensured that 
each group of babies available for study had an equal chance of being assigned to 
hear the heartbeat or to go into the silent nursery. Otherwise, he could have chosen 
the babies who looked healthier to begin with to receive the heartbeat treatment. 

In statistics, “random” is not synonymous with “haphazard”—despite what your 
thesaurus might say. Although random assignments may not be possible or ethical 
under some circumstances, in situations where randomization is feasible, it is usu- 
ally not difficult to accomplish. It can be done easily with a table of random digits, 
a computer, or even—if done carefully—by physical means such as flipping a coin 
or drawing numbers from a hat. The important feature, ensured by proper random- 
ization, is that the chances of being assigned to each condition are the same for each 
participant. Or, if the same participants are measured for all of the treatments, then 
the order in which they are assigned should be chosen randomly for each participant. 
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Randomly Assigning the Type of Treatments 

In the most basic type of randomized experiment, each participant is assigned to re- 
ceive one treatment. The decision about which treatment each participant receives 
should be done using randomization. In addition to preventing the experimenter 
from selectively choosing the best units to receive the favored treatment, randomly 
assigning the treatments to the experimental units helps protect against hidden or un- 
known biases. For example, suppose that in the experiment in Case Study 1.2, ap- 
proximately the first 11,000 physicians who enrolled were given aspirin and the 
remaining physicians were given placebos. It could be that the healthier, more ener- 
getic physicians enrolled first, thus giving aspirin an unfair advantage. 


Randomizing the Order of Treatments 

In some experiments, all treatments are applied to each unit. In that case, random- 
ization should be used to determine the order in which they are applied. For exam- 
ple, suppose an experiment is conducted to determine the extent to which drinking 
alcohol or smoking marijuana impairs driving ability. Because drivers are all so dif- 
ferent, it makes sense to test the same drivers under all three conditions (alcohol, 
marijuana, and sober) rather than using different drivers for each condition. But if 
everyone were tested under alcohol, then marijuana, then sober, by the time they 
were traveling the course for the second and third times, their performance would 
improve just from having learned something about the course. 

A better method would be to randomly assign some drivers to each of the possi- 
ble orderings so the learning effect would average out over the three treatments. No- 
tice that it is important that the assignments be made randomly. If we let the 
experimenter decide which drivers to assign to which ordering, or if we let the dri- 
vers decide, assignments could be made that would give an unfair advantage to one 
of the treatments. 


EXAMPLE 1 Randomly Assigning Mindfulness Meditation 


gem In the study resulting in News Story 1 in the Appendix, the researchers were interested 
in knowing if regular practice of meditation would enhance the immune system. If they 
nose had allowed participants to choose whether or not to meditate (the explanatory vari- 
able), there would have been confounding variables, like how hectic participants’ daily 
schedules were, that may also have influenced the immune system (the response vari- 

able). Therefore, as explained in the corresponding Original Source 1 on the CD, they re- 

cruited volunteers who were willing to be assigned to meditate or not. There were 41 
volunteers and they were randomly assigned to one of two conditions. The 25 partici- 

pants randomly assigned to the “treatment group” completed an 8-week program of 
meditation training and practice. The 16 participants randomly assigned to the “control 

group” did not receive this training during the study, but for reasons of fairness, were 

offered the training when the study was completed. The researchers had decided in ad- 

vance to assign the volunteers to the two groups as closely as possible to a 3:2 ratio. So, 

all of the volunteers had the same chance (25/41) of being assigned to receive the med- 

itation training. By using random assignment, possible confounding factors, like daily 

stress, should have been similar for the two groups. E 
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Control Groups, Placebos, and Blinding 


Control Groups 

To determine whether a drug, heartbeat sound, meditation technique, and so on, has 
an effect, we need to know what would have happened to the response variable if the 
treatment had not been applied. To find that out, experimenters create control 
groups, which are handled identically to the treatment group(s) in all respects, ex- 
cept that they don’t receive the active treatment. 


Placebos 

A special kind of control group is usually used in studies of the effectiveness of 
drugs. A substantial body of research shows that people respond not only to active 
drugs but also to placebos. (For example, see News Story 9 in the Appendix: 
“Against depression, a sugar pill is hard to beat.”) A placebo looks like the real drug 
but has no active ingredients. Placebos can be amazingly effective; studies have 
shown that they can help up to 62% of headache sufferers, 58% of those suffering 
from seasickness, and 39% of those with postoperative wound pain. 

Because the placebo effect is so strong, drug research is conducted by randomly 
assigning half of the volunteers to receive the drug and the other half to receive a 
placebo, without telling them which they are receiving. The placebo looks just like 
the real thing, so the participants will not be able to distinguish between it and the 
actual drug and thus will not be influenced by belief biases. 


Blinding 

The patient isn’t the only one who can be affected by knowing whether he or she has 
received an active drug. If the researcher who is measuring the reaction of the pa- 
tients were to know which group was which, the researcher might take the measure- 
ments in a biased fashion. To avoid these biases, good experiments use double-blind 
procedures. A double-blind experiment is one in which neither the participant nor 
the researcher taking the measurements knows who had which treatment. A single- 
blind experiment is one in which only one of the two, the participant or the re- 
searcher taking the measurements, knows which treatment the participant was 
assigned. 

Although double-blind experiments are preferable, they are not always possible. 
For example, in testing the effect of daily meditation on blood pressure, the subjects 
would obviously know if they were in the meditation group or the control group. In 
this case, the experiment could only be single-blind, in which case the person taking 
the blood pressure measurement would not know who was in which group. 


EXAMPLE 2? Blindly Lowering Cholesterol 


gem In the study described in News Story 3 in the Appendix, the researchers wanted to com- 
pare a special dietary regime with a drug known to lower cholesterol (lovastatin) to see 
wos? which one would lower cholesterol more. The special dietary regime, called the dietary 
“portfolio,” included elements thought to lower cholesterol, such as soy protein and al- 

monds. The lovastatin group was asked to eat a very low-fat diet in addition to taking 
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the drug, so a “control group” was included that received the same low-fat diet as those 
taking the lovastatin but was administered a placebo. Thus, there were three treat- 
ments—the “portfolio diet,” the low-fat diet with lovastatin, and the low-fat diet with 
placebo. 

There were 46 volunteers for the study. Here is a description from the article Origi- 
nal Source 3 on the CD, illustrating how the researchers addressed random assignment 
and blinding: 


Participants were randomized by the statistician using a random number generator 
.... The statistician held the code for the placebo and statin tablets provided with 
the control and statin diets, respectively. This aspect of the study was therefore 
double-blind. The dieticians were not blinded to the diet because they were respon- 
sible for patients’ diets and for checking diet records. The laboratory staff responsi- 
ble for analyses were blinded to treatment and received samples labeled with name 
codes and dates. (p. 504) 


In other words, the researchers and participants were both blind as to which drug (lo- 
vastatin or placebo) people in those two groups were taking, but the participants and 
dieticians could not be blind to what the participants were eating. The staff who evalu- 
ated cholesterol measurements however, could be and were blind to the treatment. m 


Matched Pairs, Blocks, and Repeated Measures 


It is sometimes easier and more efficient to have each person in a study serve as his 
or her own control. That way, natural variability in the response variable across in- 
dividuals doesn’t obscure the treatment effects. We encountered this idea when we 
discussed how to compare driving ability when under the influence of alcohol and 
marijuana and when sober. 

Sometimes, instead of using the same individual for the treatments, researchers 
will match people on traits that are likely to be related to the outcome, such as age, 
IQ, or weight. They then randomly assign each of the treatments to one member of 
each matched pair or grouping. For example, in a study comparing chemotherapy to 
surgery to treat cancer, patients might be matched by sex, age, and level of severity 
of the illness. One from each pair would then be randomly chosen to receive the 
chemotherapy and the other to receive surgery. (Of course, such a study would only 
be ethically feasible if there were no prior knowledge that one treatment was supe- 
rior to the other. Patients in such cases are always required to sign an informed 
consent.) 


Matched-Pair Designs 

Experimental designs that use either two matched individuals or the same individual 
to receive each of two treatments are called matched-pair designs. For instance, to 
measure the effect of drinking caffeine on performance on an IQ test, researchers 
could use the same individuals twice, or they could pair individuals based on initial 
IQ. If the same people were used, they might drink a caffeinated beverage in one test 
session, followed by an IQ test, and a noncaffeinated beverage in another test ses- 
sion, followed by an IQ test (with different questions). The order in which the two 
sessions occurred would be decided randomly, separately for each participant. That 
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would eliminate biases, such as learning how to take an IQ test, that would system- 
atically favor one session (first or second) over the other. 

If matched pairs of people with similar IQs were used, one person in each pair 
would be randomly chosen to drink the caffeine and the other to drink the noncaf- 
feinated beverage. The important feature of these designs is that randomization is 
used to assign the order of the two treatments. Of course, it is still important to try 
to conduct the experiment in a double-blind fashion so that neither the participant 
nor the researcher knows which order was used. In the caffeine and IQ example, nei- 
ther the participants nor the person giving the IQ test would be told which sessions 
included the caffeine. 


Randomized Block Designs and Repeated Measures 

An extension of the matched-pair design to three or more treatments is called a ran- 
domized block design, or sometimes simply a block design. The method described 
for comparing drivers under three conditions was a randomized block design. Each 
driver is called a block. This somewhat peculiar terminology results from the fact 
that these ideas were first used in agricultural experiments, where the experimental 
units were plots of land that had been subdivided into “blocks.” In the social sci- 
ences, designs such as these, in which the same participants are measured repeatedly, 
are referred to as repeated-measures designs. 


Reducing and Controlling Natural Variability and Systematic Bias 


Both natural variability and systematic bias can mask differences in the re- 
sponse variable that are due to differences in the explanatory variable. Here 
are some solutions: 


1. Random assignment to treatments is used to reduce unknown system- 
atic biases due to confounding variables that might otherwise exist be- 
tween treatment groups. 

2. Matched pairs, repeated measures, and blocks are used to reduce 


known sources of natural variability in the response variable, so that dif- 
ferences due to the explanatory variable can be detected more easily. 


Quitting Smoking with Nicotine Patches 
Source: Hurt et al. (1994), pp. 595-600. 


There is no longer any doubt that smoking cigarettes is hazardous to your health and 
to those around you. Yet, for someone addicted to smoking, quitting is no simple 
matter. One promising technique for helping people to quit smoking is to apply a 
patch to the skin that dispenses nicotine into the blood. These “nicotine patches” 
have become one of the most frequently prescribed medications in the United States. 
To test the effectiveness of these patches on the cessation of smoking, Dr. Richard 
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Hurt and his colleagues recruited 240 smokers at Mayo Clinics in Rochester, Min- 
nesota; Jacksonville, Florida; and Scottsdale, Arizona. Volunteers were required to be 
between the ages of 20 and 65, have an expired carbon monoxide level of 10 ppm or 
greater (showing that they were indeed smokers), be in good health, have a history of 
smoking at least 20 cigarettes per day for the past year, and be motivated to quit. 

Volunteers were randomly assigned to receive either 22-mg nicotine patches or 
placebo patches for 8 weeks. They were also provided with an intervention program 
recommended by the National Cancer Institute, in which they received counseling 
before, during, and for many months after the 8-week period of wearing the patches. 

After the 8-week period of patch use, almost half (46%) of the nicotine group 
had quit smoking, whereas only one-fifth (20%) of the placebo group had. Having 
quit was defined as “self-reported abstinence (not even a puff) since the last visit and 
an expired air carbon monoxide level of 8 ppm or less” (p. 596). After a year, rates 
in both groups had declined, but the group that had received the nicotine patch still 
had a higher percentage who had successfully quit than did the placebo group: 
27.5% versus 14.2%. 

The study was double-blind, so neither the participants nor the nurses taking the 
measurements knew who had received the active nicotine patches. The study was 
funded by a grant from Lederle Laboratories and was published in the Journal of the 
American Medical Association. 


5.3 Difficulties and Disasters in Experiments 


We have already introduced some of the problems that can be encountered with ex- 
periments, such as biases introduced by lack of randomization. However, many of 
the complications that result from poorly conducted experiments can be negated 
with proper planning and execution. 


Here are some potential complications: 


1. Confounding variables 

2. Interacting variables 

3. Placebo, Hawthorne, and experimenter effects 
4. Ecological validity and generalizability 


Confounding Variables 
The Problem 


Variables that are connected with the explanatory variable can distort the results of 
an experiment because they—and not the explanatory variable—may be the agent 
actually causing a change in the response variable. 


EXAMPLE 3 


EXAMPLE 4 
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The Solution 

Randomization is the solution. If experimental units are randomly assigned to treat- 
ments, then the effects of the confounding variables should apply equally to each 
treatment. Thus, observed differences between treatments should not be attributable 
to the confounding variables. 


Nicotine Patch Therapy 

The nicotine patch therapy in Case Study 5.1 was more effective when there were no 
other smokers in the participant's home. Suppose the researchers had assigned the first 
120 volunteers to the placebo group and the last 120 to the nicotine group. Further, 
suppose that those with no other smokers at home were more eager to volunteer. Then 
the treatment would have been confounded with whether there were other smokers at 
home. The observed results showing that the active patches were more effective than 
the placebo patches could have merely represented a difference between those with 
other smokers at home and those without. By using randomization, approximately equal 
numbers in each group should have come from homes with other smokers. Thus, any 
impact of that variable would be spread equally across the two groups. a 


Interacting Variables 
The Problem 


Sometimes a second variable interacts with the explanatory variable, but the results 
are reported without taking that interaction into account. The reader is then misled 
into thinking the treatment works equally well, no matter what the condition is for 
the second variable. 


The Solution 
Researchers should measure and report variables that may interact with the main ex- 
planatory variable(s). 


Other Smokers at Home 

In the experiment described in Case Study 5.1, there was an interaction between the 
treatment and whether there were other smokers at home. The researchers measured 
and reported this interaction. After the 8-week patch therapy, the proportion of the 
nicotine group who had quit smoking was only 31% if there were other smokers at 
home, whereas it was 58% if there were not. In the placebo group, the proportions who 
had quit were the same whether there were other smokers at home or not. Therefore, 
it would be misleading to merely report that 46% of the nicotine recipients had quit, 
without also providing the information about the interaction. a 


Placebo, Hawthorne, and Experimenter Effects 
The Problem 


We have already discussed the strong effect that a placebo can have on experimen- 
tal outcomes because the power of suggestion is somehow able to affect the result. 
A related idea is that participants in an experiment respond differently than they oth- 
erwise would, just because they are in the experiment. This is called the “Hawthorne 
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EXAMPLE 5 


effect” because it was first detected in 1924 during a study of factory workers at the 
Hawthorne, Illinois, plant of the Western Electric Company. (The phrase actually 
was not coined until much later; see French, 1953.) Related to these effects are nu- 
merous ways in which the experimenter can bias the results. These include record- 
ing the data erroneously to match the desired outcome, treating subjects differently 
based on which condition they are receiving, or subtly letting the subjects know the 
desired outcome. 


The Solution 

As we have seen, most of these problems can be overcome by using double-blind de- 
signs and by including a placebo group or a control group that receives identical han- 
dling except for the active part of the treatment. Other problems, such as incorrect 
data recording, should be addressed by having data entered automatically into a 
computer as it is collected, if possible. Depending on the experiment, there may still 
be subtle ways in which experimenter effects can sneak into the results. You should 
be aware of these possibilities when you read the results of a study. 


Dull Rats 

In an experiment designed to test whether the expectations of the experimenter could 
really influence the results, Rosenthal and Fode (1963) deliberately conned 12 experi- 
menters. They gave each one five rats that had been taught to run a maze. They told six 
of the experimenters that the rats had been bred to do well (that is, that they were 
“maze bright”) and told the other six that their rats were “maze dull” and should not 
be expected to do well. Sure enough, the experimenters who had been told they had 
bright rats found learning rates far superior to those found by the experimenters who 
had been told they had dull rats. Hundreds of other studies have since confirmed the 
“experimenter effect.” a 


Ecological Validity and Generalizability 
The Problem 


Suppose you wanted to compare three assertiveness training methods to see which 
was more effective in teaching people how to say no to unwanted requests on their 
time. Would it be realistic to give them the training, then measure the results by ask- 
ing them to role-play in situations in which they would have to say no? Probably not, 
because everyone involved would know it was only role-playing. The usual social 
pressures to say yes would not be as striking. This is an example of an experiment with 
little ecological validity. In other words, the variables have been removed from their 
natural setting and are measured in the laboratory or in some other artificial setting. 
Thus, the results do not accurately reflect the impact of the variables in the real world 
or in everyday life. A related problem is one we have already mentioned—namely, if 
volunteers are used for a study, can the results be generalized to any larger group? 


The Solution 

There are no ideal solutions to these problems, other than trying to design experi- 
ments that can be performed in a natural setting with a random sample from the 
population of interest. In most experimental work, these idealistic solutions are im- 
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possible. A partial solution is to measure variables for which the volunteers might 
differ from the general population, such as income, age, or health, and then try to de- 
termine the extent to which those variables would make the results less general than 
desired. In any case, when you read the results of a study, you should question its 
ecological validity and its generalizability. 


EXAMPLE 6 Real Smokers with a Desire to Quit 
The researchers in Case Study 5.1 did many things to help ensure ecological validity and 
generalizability. First, they used a standard intervention program available from and rec- 
ommended by the National Cancer Institute instead of inventing their own, so that 
other physicians could follow the same program. Next, they used participants at three 
different locations around the country, rather than in one community only, and they in- 
volved a wide range of ages (20 to 65). They included individuals who lived in house- 
holds with other smokers as well as those who did not. Finally, they recorded numerous 
other variables (sex, race, education, marital status, psychological health, and so on) 
and checked to make sure these were not related to the response variable or the patch 
assignment. a 


CASE STUDY 5.2 Exercise Yourself to Sleep 
Source: King et al. (1997), pp. 32-37. 


According to the UC Davis Health Journal (November—December 1997, p. 8), older 
adults constitute only 12% of the population but receive almost 40% of the sedatives 
prescribed. The purpose of this randomized experiment was to see if regular exercise 
could help reduce sleep difficulties in older adults. The 43 participants were sedentary 
volunteers between the ages of 50 and 76 with moderate sleep problems but no 
heart disease. They were randomly assigned either to participate in a moderate 
community-based exercise program four times a week for 16 weeks or continue to be 
sedentary. For ethical reasons, the control group was admitted to the program when 
the experiment was complete. The results were striking. The exercise group fell asleep 
an average of 11 minutes faster and slept an average of 42 minutes longer than the 
control group. Note that this could not be a double-blind experiment because partici- 
pants obviously knew whether they were exercising. Because sleep patterns were self- 
reported, there could have been a tendency to err in reporting, in the direction desired 
by the experimenters. However, this is an example of a well-designed experiment, 
given the practical constraints, and, as the authors conclude, it does allow the finding 
that “older adults with moderate sleep complaints can improve self-rated sleep qual- 
ity by initiating a regular, moderate-intensity exercise program” (p. 32). 


5.4 Designing a Good Observational Study 


In trying to establish causal links, observational studies start with a distinct disad- 
vantage compared to experiments: The researchers observe, but cannot control, the 
explanatory variables. However, these researchers do have the advantage that they 
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CASE STUDY 5.3 


are more likely to measure participants in their natural setting. Before looking at 
some complications that can arise, let’s look at an example of a well-designed ob- 
servational study. 


Baldness and Heart Attacks 
Source: Lesko, Rosenberg, and Shapiro (1993), pp. 998-1003. 


On March 8, 1993, Newsweek announced, “A really bad hair day: Researchers link 
baldness and heart attacks” (p. 62). The article reported that “men with typical male 
pattern baldness ... are anywhere from 30 to 300 percent more likely to suffer a 
heart attack than men with little or no hair loss at all.’ Pattern baldness is the type 
affecting the crown or vertex and is not the same as a receding hairline; it affects ap- 
proximately one-third of middle-aged men. 

The report was based on an observational study conducted by researchers at 
Boston University School of Medicine, in which they compared 665 men who had 
been admitted to hospitals with their first heart attack to 772 men in the same age 
group (21- to 54-years-old) who had been admitted to the same hospitals for other 
reasons. Thirty-five hospitals were involved, all in eastern Massachusetts and Rhode 
Island. 

The study found that the percentage of men who showed some degree of pattern 
baldness was substantially higher for those who had had a heart attack (42%) than 
for those who had not (34%). Further, when they used sophisticated statistical tests 
to ask the question in the reverse direction, they found an increased risk of heart at- 
tack for men with any degree of pattern baldness. The analysis methods included ad- 
justments for age and other heart attack risk factors. The increase in risk was more 
severe with increasing severity of baldness, after adjusting for age and other risk 
factors. 

The authors of the study speculated that there may be a third variable, perhaps a 
male hormone, that both increases the risk of heart attacks and leads to a propensity 
for baldness. With an observational study such as this, scientists can establish a con- 
nection, and they can then look for causal mechanisms in future work. 


Types of Observational Studies 


Case-Control Studies 

Some terms are used specifically for observational studies. Case Study 5.3 is an ex- 
ample of a case-control study. In such a study, “cases” who have a particular at- 
tribute or condition are compared with “controls” who do not. In this example, those 
who had been admitted to the hospital with a heart attack were the cases, and those 
who had been admitted for other reasons were the controls. The cases and controls 
are compared to see how they differ on the variable of interest, which in Case Study 
5.3 was the degree of baldness. 

Sometimes cases are matched with controls on an individual basis. This type of 
design is similar to a matched-pair experimental design. The analysis proceeds by 
first comparing the pair, then summarizing over all pairs. Unlike a matched-pair ex- 
periment, the researcher does not randomly assign treatments within pairs but is re- 
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stricted to how they occur naturally. For example, to identify whether left-handed 
people die at a younger age, researchers might match each left-handed case with a 
right-handed sibling as a control and compare their ages at death. Handedness could 
obviously not be randomly assigned to the two individuals, so confounding factors 
might be responsible for any observed differences. 


Retrospective or Prospective Studies 

Observational studies are also classified according to whether they are retrospec- 
tive, in which participants are asked to recall past events, or prospective, in which 
participants are followed into the future and events are recorded. The latter is a bet- 
ter procedure because people often do not remember past events accurately. 


Advantages of Case-Control Studies 


Case-control studies have become increasingly popular in medical research, and 
with good reason. Much more efficient than experiments, they do not suffer from the 
ethical considerations inherent in the random assignment of potentially harmful or 
beneficial treatments. The purpose of a case-control study is to find out whether one 
or more explanatory variables are related to a certain disease. For instance, in an ex- 
ample given later in this book, researchers were interested in whether owning a pet 
bird is related to incidence of lung cancer. 

A case-control study begins with the identification of a suitable number of cases, 
or people who have been diagnosed with the disease of interest. Researchers then 
identify a group of controls, who are as similar as possible to the cases, except that 
they don’t have the disease. To achieve this similarity, researchers often use patients 
hospitalized for other causes as the controls. For instance, in determining whether 
owning a pet bird is related to incidence of lung cancer, researchers would identify 
lung cancer patients as the cases and then find people with similar backgrounds who 
do not have lung cancer as the controls. They would then compare the proportions 
of cases and controls who had owned a pet bird. 


Efficiency 

The case-control design has some clear advantages over randomized experiments as 
well as over other observational studies. Case-control studies are very efficient in 
terms of time, money, and inclusion of enough people with the disease. Imagine try- 
ing to design an experiment to find out whether a relationship exists between own- 
ing a bird and getting lung cancer. You would randomly assign people to either own 
a bird or not and then wait to see how many in each group contracted lung cancer. 
The problem is that you would have to wait a long time, and even then, you would 
have very few cases of lung cancer in either group. In the end, you may not have 
enough cases for a valid comparison. 

A case-control study, in contrast, would identify a large group of people who had 
just been diagnosed with lung cancer and would then ask them whether they had 
owned a pet bird. A similar control group would be identified and asked the same 
question. A comparison would then be made between the proportion of cases (lung 
cancer patients) who had birds and the proportion of controls who had birds. 
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Reducing Potential Confounding Variables 

Another advantage of case-control studies over other observational studies is that the 
controls are chosen to try to reduce potential confounding variables. For example, in 
Case Study 5.3, suppose it were true that bald men were simply less healthy than 
other men and were therefore more likely to get sick in some way. An observational 
study that recorded only whether someone had baldness and whether they had had a 
heart attack would not be able to control for that fact. By using other hospitalized pa- 
tients as the controls, the researchers were able to at least partially account for gen- 
eral health as a potential confounding factor. 

You can see that careful thought is needed to choose controls that reduce poten- 
tial confounding factors and do not introduce new ones. For example, suppose we 
wanted to know if heavy exercise induced heart attacks and, as cases, we used peo- 
ple as they were admitted to the hospital with a heart attack. We would certainly not 
want to use other newly admitted patients as controls. People who were sick enough 
to enter the hospital (for anything other than sudden emergencies) would probably 
not have been recently engaging in heavy exercise. When you read the results of a 
case-control study, you should pay attention to how the controls were selected. 


5.5 Difficulties and Disasters in Observational Studies 


As with other types of research we have discussed, when you read the results of ob- 
servational studies you need to watch for procedures that could negate the results of 
the study. 


Here are some complications that can arise: 
1. Confounding variables and the implications of causation 
2. Extending the results inappropriately 


3. Using the past as a source of data 


Confounding Variables and the Implications 
of Causation 


The Problem 

Don’t be fooled into thinking that a link between two variables established by an ob- 
servational study implies that one causes the other. There is simply no way to sepa- 
rate out all potential confounding factors if randomization has not been used. 


The Solution 
A partial solution is achieved if researchers measure all the potential confounding 
variables they can imagine and include those in the analysis to see whether they are 
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related to the response variable. Another partial solution can be achieved in case- 
control studies by choosing the controls to be as similar as possible to the cases. The 
other part of the solution is up to the reader: Don’t be fooled into thinking a causal 
relationship exists. 

There are some guidelines that can be used to assess whether a collection of ob- 
servational studies indicates a causal relationship between two variables. These 
guidelines are discussed in Chapter 11. 


Smoking During Pregnancy 

In Chapter 1, we introduced a study showing that women who smoked during preg- 
nancy had children whose IQs at age 4 were lower than those of similar women who 
had not smoked. The difference was as high as 9 points before accounting for con- 
founding variables, such as diet and education, but was reduced to just over 4 points 
after accounting for those factors. However, other confounding variables could exist 
that are different for mothers who smoke and that were not measured and analyzed, 
such as amount of exercise the mother got during pregnancy. Therefore, we should 
not conclude that smoking during pregnancy necessarily caused the children to have 
lower IQs. E 


Extending the Results Inappropriately 
The Problem 


Many observational studies use convenience samples, which generally are not rep- 
resentative of any population. Results should be considered with that in mind. Case- 
control studies often use only hospitalized patients, for example. 

In general, results of a study can be extended to a larger population only if the 
sample is representative of that population for the variables studied. For instance, in 
News Story 2 in the Appendix, “Research shows women hit harder by hangovers,” 
the research was based on a sample of students in introductory psychology classes 
at a university in the midwestern United States. The study compared drinking be- 
havior and hangover symptoms in men and women. To whom do you think the re- 
sults can be extended? It probably isn’t reasonable to think they can be extended to 
all adults because most of the students were not even of legal drinking age. But what 
about all people in the same age group as the students studied? All college students? 
College students in the Midwest? Only psychology students? As a reader, you must 
decide the extent to which you think this sample represents each of these populations 
on the question of alcohol consumption and hangover symptoms and their differ- 
ences for men and women. 


The Solution 

If possible, researchers should use an entire segment of the population of interest 
rather than just a convenient sample. In studying the relationship between smoking 
during pregnancy and child’s IQ, described in Example 7, the researchers included 
most of the women in a particular county in upstate New York who were pregnant 
during the right time period. Had they relied solely on volunteers recruited through 
the media, their results would not be as extendable. 
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EXAMPLE 9 


Baldness and Heart Attacks 


The observational study relating baldness and heart attacks only used men who were 
hospitalized for some reason. Although that may make sense in terms of providing a 
more similar control group, you should consider whether the results should be extended 
to all men. a 


Using the Past as a Source of Data 
The Problem 


Retrospective observational studies can be particularly unreliable because they ask 
people to recall past behavior. Some medical studies, in which the response variable 
is whether someone has died, can be even worse because they rely on the memories 
of relatives and friends rather than on the actual participants. Retrospective studies 
also suffer from the fact that variables that confounded things in the past may no 
longer be similar to those that would currently be confounding variables, and re- 
searchers may not think to measure them. Example 9 illustrates this problem. 


The Solution 

If at all possible, prospective studies should be used. That’s not always possible. For 
example, researchers who first considered the potential causes of AIDS or Toxic 
Shock Syndrome had to start with those who were afflicted and try to find common 
factors from their pasts. If possible, retrospective studies should use authoritative 
sources such as medical records rather than relying on memory. 


Do Left-Handers Die Young? 


A few years ago, a highly publicized study pronounced that left-handed people did not 
live as long as right-handed people (Coren and Halpern, 1991). In one part of the study, 
the researchers had sent letters to next of kin for a random sample of recently deceased 
individuals, asking which hand the deceased had used for writing, drawing, and throw- 
ing a ball. They found that the average age of death for those who had been left- 
handed was 66, whereas for those who had been right-handed, it was 75. What they 
failed to take into account was that in the early part of the 20th century, many children 
were forced to write with their right hands, even if their natural inclination was to be 
left-handed. Therefore, people who died in their 70s and 80s during the time of this 
study were more likely to be right-handed than those who died in their 50s and 60s. The 
confounding factor of how long ago one learned to write was not taken into account. 
A better study would be a prospective one, following current left- and right-handers to 
see which group survived longer. Oo 


5.6 Random Sample versus Random Assignment 


While random sampling and random assignment to treatments are related ideas, the 
conclusions that can be made based on each of them are very different. Random 
sampling is used to get a representative sample from the population of interest. Ran- 
dom assignment is used to control for confounding variables and other possible 
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sources of bias. An ideal study would use both, but for practical reasons, that is 
rarely done. 


Extending Results to a Larger Population: 
Random Sampling 


The main purpose of using a random sample in a study is that the results can be ex- 
tended to the population from which the sample was drawn. As we will learn in later 
chapters, there is still some degree of uncertainty, similar to the margin of error in- 
troduced in Chapter 4, but it can be stated explicitly. Therefore, it would be ideal to 
use a random sample from the population of interest in all statistical studies. 

Unfortunately, it is almost always impractical to obtain a random sample to par- 
ticipate in a randomized experiment or to be measured in an observational study. For 
instance, in the study in Example 1, it would be impossible to obtain a random sam- 
ple of people interested in learning meditation from among all adults and then teach 
some of them how to do so. Instead, the researchers in that study used employees at 
one company and asked for volunteers. 

The extent to which results can be extended to a larger population when a random 
sample is not used depends on the extent to which the participants in the study are 
representative of a larger population for the variables being studied. For example, in 
the nicotine patch experiment described in Case Study 5.1, the participants were pa- 
tients who came to a clinic seeking help to quit smoking. They probably do represent 
the population of smokers with a desire to quit. Thus, even though volunteers were 
used for the study, the results can be extended to other smokers with a desire to quit. 

As another example, the experiment in Case Study 1.2, investigating the effect 
of taking aspirin on heart attack rates, used male physicians. Therefore, the results 
may not apply to women, or to men who have professions with very different 
amounts of physical activity than physicians. 

As a reader, you must determine the extent to which you think the participants in 
a study are representative of a larger population for the question of interest. That’s 
why it’s important to know the answer to Critical Component 3 in Chapter 2, “The 
individuals or objects studied and how they were selected.” 


Establishing Cause and Effect: Random Assignment 


The main purpose of random assignment of treatments, or of the order of treatments, 
is to even out confounding variables across treatments. By doing this, a cause-and- 
effect conclusion can be inferred that would not be possible in an observational 
study. With randomization to treatments, the range of values for confounding vari- 
ables should be similar for each of the treatment groups. For instance, in Case Study 
5.1, whether someone is a light or heavy smoker may influence their ability to quit 
smoking. By randomly assigning participants to wear a nicotine patch or a control 
patch, about the same proportion of heavy smokers should be in each patch-type 
group. In Case Study 5.2, caffeine consumption may influence older adults’ ability 
to fall asleep quickly. By randomly assigning them to the exercise program or not, 
about the same proportion of caffeine drinkers should be in each group. 
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Without random assignment, naturally occurring confounding variables can re- 
sult in an apparent relationship between the explanatory and response variables. For 
instance, in Example 1, if the participants had chosen to meditate or not, the two 
groups probably would have differed in other ways, such as diet, that may affect im- 
mune system functioning. If the participants had been assigned in a nonrandom way 
by the experimenters, they could have chosen those who looked most healthy to par- 
ticipate in the meditation program. Thus, without random assignment, it would not 
have been possible to conclude that meditation was responsible for the observed dif- 
ference in immune system functioning. 

As a reader, it is important for you to think about Component 6 from Chapter 2, 
“Differences in the groups being compared, in addition to the factor of interest.” If 
random assignment was used, these differences should be minimized. If random as- 
signment was not used, you must assess the extent to which you think group differ- 
ences may explain any observed relationships. In Chapter 11 we will learn more 
about establishing cause and effect when randomization isn’t used. 


Exercises 


Asterisked (*) exercises are included in the Solutions at the back of the book. 


1. Explain why it may be preferable to conduct a randomized experiment rather 
than an observational study to determine the relationship between two vari- 
ables. Support your argument with an example concerning something of inter- 
est to you. 


*2. In each of the following examples, explain whether the experiment was double- 
blind, single-blind, or neither, and explain whether it was a matched-pair or 
block design or neither. 


*a. A utility company was interested in knowing if agricultural customers would 
use less electricity during peak hours if their rates were different during 
those hours. Customers were randomly assigned to continue to get standard 
rates or to receive the time-of-day rate structure. Special meters were at- 
tached that recorded usage during peak and off-peak hours, which the cus- 
tomers could read. The technician who read the meter did not know what rate 
structure each customer had. 


*b. To test the effects of drugs and alcohol on driving performance, 20 volun- 
teers were each asked to take a driving test under three conditions: sober, af- 
ter two drinks, and after smoking marijuana. The order in which they drove 
under each condition was randomized. An evaluator watched them drive on 
a test course and rated their accuracy on a scale from | to 10, without know- 
ing which condition they were under each time. 


*c. To compare four brands of tires, one of each brand was randomly assigned 
to the four locations on each of 50 cars. These tires were specially manufac- 
tured without any labels identifying the brand. After the tires had been on the 
cars for 30,000 miles, the researchers removed them and measured the re- 
maining tread. 
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3. Designate the explanatory variable and the response variable for each of the 
three studies in Exercise 2. 


4. Refer to Thought Question 5 at the beginning of this chapter. The headline was 
based on a study in which a representative sample of over 400,000 adults in the 
United States were asked a series of questions, including level of education and 
on how many of the past 30 days they felt physically and emotionally healthy. 


a. What were the intended explanatory variable and response variable for this 
study? 


b. Explain how each of the “difficulties and disasters in observational studies” 
(Section 5.5) applies to this study, if at all. 


*5. A study to see whether birds remember color was done by putting birdseed on 
a piece of red cloth and letting the birds eat the seed. Later, empty pieces of cloth 
of varying colors (red, purple, white, and blue) were displayed. The birds 
headed for the red cloth. The researcher concluded that the birds remembered 
the color. 


*a, Using the terminology in this chapter, give an alternative explanation for the 
birds’ behavior. 


b. Suppose 20 birds were available and they could each be tested separately. 
Suggest a better method for the study than the one used. 


6. Suppose researchers were interested in determining the relationship, if any, be- 
tween brain cancer and the use of cellular telephones. Would it be better to use 
a randomized experiment or a case-control study? Explain. 


7. Researchers have found that women who take oral contraceptives (birth control 
pills) are at higher risk of having a heart attack or stroke and that the risk is sub- 
stantially higher if a woman smokes. In investigating the relationship between 
taking oral contraceptives (the explanatory variable) and having a heart attack or 
stroke (the response variable), would smoking be called a confounding variable 
or an interacting variable? Explain. 


Each of the situations in Exercises 8 to 10 contains one of the complications listed 
as “difficulties and disasters” with designed experiments or observational studies. 
Explain the problem and suggest how it could have been either avoided or ad- 
dressed. If you think more than one complication could have occurred, mention them 
all, but go into detail about only the most problematic. 


*8. To study the effectiveness of vitamin C in preventing colds, a researcher re- 
cruited 200 volunteers. She randomly assigned 100 of them to take vitamin C 
for 10 weeks and the remaining 100 to take nothing. The 200 participants 
recorded how many colds they had during the 10 weeks. The two groups were 
compared, and the researcher announced that taking vitamin C reduces the fre- 
quency of colds. 


9. A researcher was interested in teaching couples to communicate more effec- 
tively. She had 20 volunteer couples, 10 of which were randomly assigned to re- 
ceive the training program and 10 of which were not. After they had been trained 
(or not), she presented each of the 20 couples with a hypothetical problem situ- 
ation and asked them to resolve it while she tape-recorded their conversation. 
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10. 


*11. 
12. 


13. 


14. 


15. 


She was blind as to which 10 couples had taken the training program until after 
she had analyzed the results. 


Researchers ran an advertisement in a campus newspaper asking for sedentary 
volunteers who were willing to begin an exercise program. The volunteers were 
allowed to choose which of three programs they preferred: jogging, swimming, 
or aerobic dance. After 5 weeks on the exercise programs, weight loss was mea- 
sured. The joggers lost the most weight, and the researchers announced that jog- 
ging was better for losing weight than either swimming or aerobic dance. 


Refer to Exercise 10. What are the explanatory and response variables? 


Suppose you wanted to know if men or women students spend more money on 
clothes. You consider two different plans for carrying out an observational 
study: 


Plan 1; Ask the participants how much they spent on clothes during the last 
3 months; then compare the men and women. 


Plan 2: Ask the participants to keep a diary in which they record their cloth- 
ing expenditures for the next 3 months; then compare the men and women. 


a. Which of these plans is a retrospective study? What term is used for the other 
plan? 

b. Give one disadvantage of each plan. 

Suppose an observational study finds that people who use public transportation 

to get to work have better knowledge of current affairs than those who drive to 

work, but that the relationship is weaker for well-educated people. What term 

from the chapter (for example, response variable) applies to each of the follow- 

ing variables? 

a. Method of getting to work 

b. Knowledge of current affairs 

c. Level of education 

d. Whether the participant reads a daily newspaper 


A case-control study claimed to have found a relationship between drinking cof- 
fee and pancreatic cancer. The cases were people recently hospitalized with pan- 
creatic cancer, and the controls were people hospitalized for other reasons. 
When asked about their coffee consumption for the past year, it was found that 
the cancer cases drank more coffee than the controls. Give a reasonable expla- 
nation for this difference other than a relationship between coffee drinking and 
pancreatic cancer. 


A headline in the Sacramento Bee (11 December 1997, p. A15) read, “Study: 
Daily drink cuts death,” and the article began with the statement, “One drink a 
day can be good for health, scientists are reporting, confirming earlier research 
in a new study that is the largest to date of the effects of alcohol consumption 
in the United States.” The article also noted that “most subjects were white, 
middle-class and married, and more likely than the rest of the U.S. population 
to be college-educated.” 
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a. Explain why this study could not have been a randomized experiment. 
b. Explain whether you think the headline is justified for this study. 


c. The study was based on recording drinking habits for the 490,000 partici- 
pants in 1982 and then noting death rates for the next 9 years. Is this a ret- 
rospective or a prospective study? 


d. Comment on each of the “difficulties and disasters in observational studies” 
(Section 5.5) as applied to this study. 


*16. Is it possible to conduct a randomized experiment to compare two conditions us- 
ing volunteers recruited through the newspaper? If not, explain why not. If so, 
explain how it would be done and explain any “difficulties and disasters” that 
would be encountered. 

17. Explain why a randomized experiment allows researchers to draw a causal con- 
clusion, whereas an observational study does not. 
18. Refer to Case Study 5.2, “Exercise Yourself to Sleep.” 
a. Discuss each of the “difficulties and disasters in experiments” (Section 5.3) 
as applied to this experiment. 
b. Explain whether the authors can conclude that exercise actually caused im- 
provements in sleep. 
19. Explain why each of the following is used in experiments: 
a. Placebo treatments 
b. Blinding 
c. Control groups 
20. Is the “experimenter effect” most likely to be present in a double-blind experi- 
ment, a single-blind experiment, or an experiment with no blinding? Explain. 
21. Give an example of a randomized experiment that would have poor ecological 
validity. 
*22. Explain which of the “difficulties and disasters” is most likely to be a problem 


23 


in each of the following experiments, and why: 


*a. To see if eating just before going to bed causes nightmares, volunteers are re- 
cruited to spend the night in a sleep laboratory. They are randomly assigned 
to be given a meal before bed or not. Numbers of nightmares are recorded 
and compared for the two groups. 


*b. A company wants to know if placing green plants in workers’ offices will 
help reduce stress. Employees are randomly chosen to participate, and plants 
are delivered to their offices. One week later, all employees are given a stress 
questionnaire and those who received plants are compared with those who 
did not. 


. Explain which of the “difficulties and disasters” is most likely to be a problem 
in each of the following observational studies, and why: 

a. A study measured the number of writing courses taken by students and their 

subsequent scores on the quantitative part of the Graduate Record Exam. The 
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students who had taken the largest number of writing courses scored lowest 
on the exam, so the researchers concluded that students who want to pursue 
graduate careers in quantitative areas should not take many writing courses. 


b. Successful female social workers and engineers were asked to recall whether 
they had any female professors in college who were particularly influential 
in their choice of career. More of the engineers than the social workers re- 
called a female professor who stood out in their minds. 


. For each of the following news stories in the Appendix, explain whether the 


study was a randomized experiment or an observational study. If necessary, con- 
sult the original source of the study on the CD. 


a. News Story 4: “Happy people can actually live longer.” 
b. News Story 6: “Music as brain builder.” 

c. News Story 11: “Double trouble behind the wheel.” 

d. News Story 15: “Kids’ stress, snacking linked.” 


For each of the following observational studies in the news stories in the Ap- 
pendix, specify the explanatory and response variables. 


*a. News Story 10: “Churchgoers live longer, study finds.” 
b. News Story 12: “Working nights may increase breast cancer risk.” 
c. News Story 16: “More on TV violence.” 


d. News Story 18: “Heavier babies become smarter adults, study shows.” 


. Read News Story 5, “Driving while distracted is common, researchers say,” and 


consult the first page of the “Executive Summary” in Original Source 5, “Dis- 
tractions in Everyday Driving,” on the CD. Explain the extent to which ecolog- 
ical validity may be a problem in this study and what the researchers did to try 
to minimize the problem. 
Read News Story 8 in the Appendix, “Education, kids strengthen marriage.” 
Discuss the extent to which each of these problems with observational studies 
may affect the conclusions based on this study: 
a. Confounding variables and the implications of causation 
b. Extending the results inappropriately 
c. Using the past as a source of data 
Read News Story 12 in the Appendix, “Working nights may increase breast can- 
cer risk.” The story describes two separate observational studies, one by Scott 
Davis and co-authors and one by Francine Laden and co-authors. Both studies 
and an editorial describing them are included on the CD, and you may need to 
consult them. In each case, explain whether 

*a, A case-control study was used or not. If it was, explain how the “controls” 

were chosen. 

*b. A retrospective or a prospective study was used. 
Read each of these news stories in the Appendix, and consult the original 
source article on the CD if necessary. In each case, explain whether or not a re- 
peated measures design was used. 
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a. News Story 6: “Music as brain builder.” 
b. News Story 11: “Double trouble behind the wheel.” 


c. News Story 14: “Study: Emotion hones women’s memory.” 


Ve.) *30. For each of the following examples of relationships based on news stories in the 


Appendix and their original sources on the CD, explain whether a cause-and- 

effect conclusion is justified: 

a. News Story | reported that people who participated in the meditation pro- 
gram had better immune system response to a flu vaccine. 

*b. News Story 13 reported that teens who had more than $25 a week in spend- 
ing money were more likely to use drugs than kids with less spending 
money. 

c. News Story 15 reported that kids with higher levels of stress in their lives 
were more likely to eat high-fat foods and snacks. 


oe *31. The relationships described in the previous question are listed again here. In 


each case, explain the extent to which you think the results of the study can be 
extended to a larger population. 


a. News Story | reported that people who participated in the meditation pro- 
gram had better immune system response to a flu vaccine. 


*b. News Story 13 reported that teens who had more than $25 a week in spend- 
ing money were more likely to use drugs than kids with less spending 
money. 

c. News Story 15 reported that kids with higher levels of stress in their lives 
were more likely to eat high-fat foods and snacks. 


Mini-Projects 


. Design an experiment to test something of interest to you. Explain how your de- 


sign addresses each of the four complications listed in Section 5.3, “Difficulties 
and Disasters in Experiments.” 


. Design an observational study to test something of interest to you. Explain how 


your design addresses each of the three complications listed in Section 5.5, 
“Difficulties and Disasters in Observational Studies.” 


. Go to the library or the Internet and locate a journal article that describes a ran- 


domized experiment. Explain what was done correctly and incorrectly in the ex- 
periment and whether you agree with the conclusions drawn by the authors. 


. Go to the library or the Internet and locate a journal article that describes an ob- 


servational study. Explain how it was done using the terminology of this chap- 
ter and whether you agree with the conclusions drawn by the authors. 


. Design and carry out a single-blind study using 10 participants. Your goal is to 


establish whether people write more legibly with their dominant hand. In other 
words, do right-handed people write more legibly with their right hand and do 
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left-handed people write more legibly with their left hand? Explain exactly what 
you did, including how you managed to conduct a single-blind study. Mention 
things such as whether it was an experiment or an observational study and 
whether you used matched pairs or not. 


6. Pick one of the news stories in the Appendix that describes a randomized ex- 
periment and that has one or more journal articles accompanying it on the CD. 
Explain what was done in the experiment using the terminology and concepts in 
this chapter. Discuss the extent to which you agree with the conclusions drawn 
by the authors of the study and of the news story. Include a discussion of 
whether a cause-and-effect conclusion can be drawn for any observed relation- 
ships and the extent to which the results can be extended to a larger population. 


7. Pick one of the news stories in the Appendix that describes an observational 
study and that has one or more journal articles accompanying it on the CD. Ex- 
plain what was done in the experiment using the terminology and concepts in 
this chapter. Discuss the extent to which you agree with the conclusions drawn 
by the authors of the study and of the news story. Include a discussion of 
whether a cause-and-effect conclusion can be drawn for any observed relation- 
ships and the extent to which the results can be extended to a larger population. 
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CHAPTER 


Getting the Big Picture 


6.1 Final Questions 


By now, you should have a fairly clear picture of how data should be acquired in or- 
der to be useful. We have examined how to conduct a sample survey, a randomized 
experiment, and an observational study and how to critically evaluate what others 
have done. In this chapter, we look at a few examples in depth and determine what 
conclusions can be drawn from them. The final question you should ask when you 
read the results of research is whether you will make any changes in your lifestyle 
or beliefs as a result of the research. To reach that conclusion, you need to answer a 
series of questions—not all statistical—for yourself. 


107 


108 


PART 1 Finding Data in Life 


CASE STUDY 6.1 


Here are some guidelines for how to evaluate a study: 


Step 1: Determine if the research was a sample survey, a randomized exper- 
iment, an observational study, a combination, or based on anecdotes. 


Step 2: Consider the Seven Critical Components in Chapter 2 (p. 18-19) to 
familiarize yourself with the details of the research. 


Step 3: Based on the answer in step 1, review the “difficulties and disas- 
ters” inherent in that type of research and determine if any of them apply. 


Step 4: Determine if the information is complete. If necessary, see if you 
can find the original source of the report or contact the authors for missing 
information. 


Step 5: Ask if the results make sense in the larger scope of things. If they 
are counter to previously accepted knowledge, see if you can get a possible 
explanation from the authors. 


Step 6: Ask yourself if there is an alternative explanation for the results. 


Step 7: Determine if the results are meaningful enough to encourage you to 
change your lifestyle, attitudes, or beliefs on the basis of the research. 


Mozart, Relaxation, and Performance on Spatial Tasks 


Source: Rauscher, Shaw, and Ky (14 October 1993), p. 611. 


Summary 


The researchers performed a repeated-measures experiment on 36 college students. 
Each student participated in three listening conditions, each of which was followed 
by a set of abstract/visual reasoning tasks taken from the Stanford-Binet IQ test. The 
conditions each lasted for 10 minutes. They were 


1. Listening to Mozart’s Sonata for Two Pianos in D Major 
2. Listening to a relaxation tape designed to lower blood pressure 


3. Silence 


The tasks were taken from the three abstract/visual reasoning parts of the Stanford- 
Binet IQ test that are suitable for adults: a pattern analysis test, a multiple-choice ma- 
trices test, and a multiple-choice paper folding and cutting test. The abstract/visual 
reasoning parts constitute one of four categories of the Stanford-Binet test; the oth- 
ers are verbal reasoning, quantitative reasoning, and short-term memory. None of 
those were tested in this experiment. 

The scores on the abstract/visual reasoning tasks were translated into what the 
corresponding IQ score would have been for a full-fledged test. The results showed 
averages of 119, 111, and 110, respectively, for the three listening conditions. The 
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results after listening to Mozart were significantly higher than those for the other two 
conditions—enough so that chance differences could be ruled out as an explanation. 

The researchers tested some potential confounding factors and found that none 
of these could explain the results. First, they measured pulse rates before and after 
each listening session to be sure the results weren’t simply due to arousal. They 
found no effects for pulse rate, nor any interactions between pulse rate and IQ test 
results. Next, they tested to see if the order of presentation or the use of different ex- 
perimenters could have been confounded with the results, and again found no effect. 
They noted that the three different tests correlated strongly with each other, so they 
treated them as “equal measures of abstract reasoning ability.” 


Discussion 


To evaluate the usefulness of this research, let’s analyze it according to the seven 
steps listed at the beginning of this chapter. 


Step 1: Determine if the research was a sample survey, a randomized experiment, 
an observational study, a combination, or based on anecdotes. 

This is supposed to be a randomized experiment, although the authors do not 
provide information about how (or if) they randomly assigned the order of the three 
listening conditions. It qualifies as an experiment because they manipulated the en- 
vironment of the participants. Notice that because the same people were tested un- 
der all three listening conditions, this was a repeated-measures experiment. 


Step 2: Consider the Seven Critical Components in Chapter 2 (pp. 18—19) to famil- 
iarize yourself with the details of the research. 

Based on the information given, you should be able to understand some, but not 
all, of the Seven Critical Components. The most important information missing relates 
to Component 2, the researchers who had contact with the participants. We were not 
told whether those who tested the participants knew the purpose of the experiment. In- 
formation about funding is missing as well (Component 1), but presumably the re- 
search was conducted at a university because the participants were college students. 
Finally, we were not told how the participants were selected (part of Component 3). 
The results might be interpreted differently if they were music majors, for example. 


Step 3: Based on the answer in step 1, review the “difficulties and disasters” inher- 
ent in that type of research and determine if any of them apply. 

The four possible complications listed for an experiment include confounding 
variables; interacting variables; placebo, Hawthorne, and experimenter effects; and 
ecological validity and generalizability. In this experiment, all four could be prob- 
lematic, but the most obvious is a possible experimenter effect. We were not told 
whether the subjects knew the intent of the experimenters, but even if they were not 
explicitly told, they may have figured out that they were expected to do better after 
listening to Mozart. This could create inflated scores after the Mozart condition or 
deflated scores after the other two. There is really no way to get around this because 
the subjects could not be blind to the listening condition and they were tested under 
all three conditions. 

Another problem is generalizability. It is probably not true that results obtained 
after 10 minutes in a laboratory would extend directly to the real world. 
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There are also potential confounding variables. For example, we were not told 
whether the particular IQ task assigned after each listening condition was the same 
for each subject or was randomized among the three tasks. If it was the same, and if 
one task was easier for this particular group of volunteers, that condition would be 
confounded with the listening condition. We were also not told whether the experi- 
menters had as much contact with the participants for the silent condition as for the 
two listening conditions. If they did not, then amount of contact could have inter- 
acted with the listening condition to produce the effect. 


Step 4: Determine if the information is complete. If necessary, see if you can find 
the original source of the report or contact the authors for missing information. 

The summary at the beginning of this case study contains almost all of the in- 
formation in the original report in Nature, which was probably shorter than the au- 
thors would have liked because it was contained in the “Scientific Correspondence” 
section of the magazine. Substantial information is missing, some of which would 
probably help determine whether the complications listed in step 3 were really 
problems. 


Step 5: Ask if the results make sense in the larger scope of things. If they are 
counter to previously accepted knowledge, see if you can get a possible explanation 
from the authors. 

The authors gave references to “correlational, historical and anecdotal relation- 
ships between music cognition and other ‘higher brain functions’ ” but did not oth- 
erwise attempt to justify how their results could be explained. 


Step 6: Ask yourself if there is an alternative explanation for the results. 

As mentioned in step 3, the subjects were not blind to listening conditions and 
could have performed better after the Mozart condition to satisfy the experimenters. 
Perhaps the particular IQ task assigned after the Mozart condition was easier for this 
group. Perhaps the experimenters interacted more (or less) with the participants un- 
der the listening conditions. 


Step 7: Determine if the results are meaningful enough to encourage you to change 
your lifestyle, attitudes, or beliefs on the basis of the research. 

If these results are accurate, they indicate that listening to Mozart raises a certain 
type of IQ for at least a short period of time. That could be useful to you if you are 
about to take a test involving abstract or spatial reasoning. 


Meditation and Aging 


ORIGINAL Source: Glaser et al. (1992), pp. 327-341. 


News Source: The effects of meditation on aging, Noetic Sciences Review (Summer 1993), 
p. 28. 


Summary 


Meditation may have more to offer than a calm mind and lower blood pressure. Re- 
cent research reported in the Journal of Behavioral Medicine shows that a simple 
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meditation practiced twice a day for a 20-minute period leads to marked changes in 
an age-associated enzyme, DHEA-S. Levels of DHEA-S in experienced meditators 
correspond to those expected of someone 5—10 years younger who does not meditate. 

The enzyme is produced by the adrenal glands, and its level is closely correlated 
with age in humans. It has also been associated with measures of health and stress. 
Higher levels are specifically associated with lower incidences of heart disease and 
lower rates of mortality in general for males and with less breast cancer and osteo- 
porosis in women. This study compared levels of DHEA-S in 270 men and 153 
women who had practiced Transcendental Meditation (TM) or TM-Sidhi for a mean 
length of 10.3 years or 11.1 years, respectively. The meditation technique is a sim- 
ple mental technique in which the meditator sits with eyes closed while remaining 
wakeful and alert, focusing without effort on a specific meaningless sound 
(mantram). 

The results show that DHEA-S levels are higher in all age groupings and in both 
sexes among those who practice meditation than in nonmeditating controls, an effect 
independent of dietary habits and alcohol or drug consumption. 

Are these salutary changes directly due to “sitting in meditation”? According to 
Jay Glaser, who headed the study, the effect may be because meditators learn to ap- 
proach life with less physiological reaction to stress. Whatever the case, spending 20 
minutes twice a day for a body that is measurably more youthful seems like a fair 
exchange. 


Discussion 


Step 1: Determine if the research was a sample survey, a randomized experiment, 
an observational study, a combination, or based on anecdotes. 

This was an observational study because the researchers did not assign partici- 
pants to either meditate or not; they simply measured meditators and nonmeditators. 


Step 2: Consider the Seven Critical Components in Chapter 2 (pp. 18-19) to famil- 
iarize yourself with the details of the research. 

Due to the necessary brevity of the news report, several pieces of information 
from the original report are missing. Therefore, based on the news report alone, you 
would not be able to consider all of the components. Following are some of the miss- 
ing pieces, derived from the original report. The first author on the study, J. L. 
Glaser, was a researcher at the Maharishi International University, widely known for 
teaching TM. No acknowledgments were given for funding, so presumably there 
were no outside funders. The control group consisted of 799 men and 453 women 
who “represented a healthy fraction of the patients of a large, well-known New York 
City practice specializing in cosmetic dermatology who visited the practice from 
1980 to 1983 for cosmetic procedures such as hair transplants, dermabrasion, and re- 
moval of warts and moles” (Glaser et al., 1992, p. 329). 

The TM group was recruited at the campus of the Maharishi International Uni- 
versity (MIU) in Fairfield, Iowa. All of those under 45 years old and 28 of those over 
45 were recruited from local faculty, staff, students, and meditating community 
members. The remainder of those over 45 were recruited “by public announcements 
during conferences for advanced TM practitioners held on campus from 1983 to 


112 


PART 1 


Finding Data in Life 


1987” (Ibid., p. 329). Ninety-two percent of the women and 93% of the men were 
practitioners of the more advanced TM-Sidhi program. More of the meditators than 
the controls were vegetarians, and the meditators were less likely to drink alcohol or 
smoke. 

The measurements were made from blood samples drawn at office visits 
throughout the year for the control group, and at specified time periods for the TM 
group (men under 45 years of age: between 10:45 and 11:45 a.m. in September and 
October; women under 45 years of age: same hours in April and May; over 45 years 
of age: between 1:00 and 2:30 p.m., in December for men and in July for women). 

DHEA-S levels were measured using direct radioimmunoassay. The control and 
TM groups were assayed in different batches, but random samples from the control 
group were included in all batches to make sure there was no drift over time. Also, 
contrary to the news summary, the original report noted a difference in DHEA-S lev- 
els for all age groups of women, but DHEA-S levels for men varied only for those 
over 40. 


Step 3: Based on the answer in step 1, review the “difficulties and disasters” inher- 
ent in that type of research and determine if any of them apply. 

The most obvious potential problem in this observational study (see p. 96) is 
complication 1, “confounding variables and implications of causation.” Many dif- 
ferences between the TM and control groups could be confounding the results, and 
because there is no random assignment of treatments, a causal conclusion about the 
effects of meditation cannot be made. Also, the results cannot be extended to people 
other than those similar to the TM group measured. For example, there is no way to 
know if they would extend to practitioners of other relaxation or meditation tech- 
niques, or even to practitioners of TM who are not as heavily involved with it as 
those attending MIU. We consider other explanations in step 6. 


Step 4: Determine if the information is complete. If necessary, see if you can find 
the original source of the report or contact the authors for missing information. 

Most of the necessary information was available, at least in the original report. 
One piece of missing information was whether those who drew and analyzed the 
blood knew the purpose of the study. 


Step 5: Ask if the results make sense in the larger scope of things. If they are 
counter to previously accepted knowledge, see if you can get a possible explanation 
from the authors. 

In the news report, we are not given any information about prior medical knowl- 
edge of the effects of meditation. However, returning to the original report, we find 
that the authors do cite other evidence and give potential mechanistic explanations 
for why meditation may help increase the level of the enzyme. They also cite a study 
showing that 2000 practitioners of TM who were enrolled in major medical plans 
made less use of the plans than nonmeditators in every category except obstetrics. 
They noted that the TM group had 55.5% fewer admissions for tumors and 87% 
fewer admissions for heart disease than the comparison group. Of course, these re- 
sults do not imply a causal relationship either because they are based on an obser- 
vational study. They simply add support to the idea that meditators are healthier than 
nonmeditators. 
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Step 6: Ask yourself if there is an alternative explanation for the results. 

The obvious explanation is that those who choose to practice TM, especially to 
the degree that they would be visiting or attending MIU, are somehow different, and 
probably healthier than those who do not. There is no way to test this assumption. 
Remember that more of the meditators were vegetarians and they were less likely to 
drink alcohol or smoke. The authors discuss these points, however, and cite evidence 
showing that these factors would not influence the levels of this particular enzyme 
in the observed direction. Another potential confounding variable is the location of 
the test. The control group was in New York City and the TM group was in Iowa. 
Also, the control group consisted of people visiting plastic surgeons, who may be 
more likely to show early signs of aging. Perhaps you can think of other potential 
explanations. 


Step 7: Determine if the results are meaningful enough to encourage you to change 
your lifestyle, attitudes, or beliefs on the basis of the research. 

Because there is no way to establish a causal connection, these results must be 
taken only as support of a difference between the two groups in the study. 
Nonetheless, we cannot rule out the idea that meditation may be the cause of the 
observed differences between the TM and control groups, and if slowing down the 
aging process were crucial to you, these results might encourage you to learn to 
meditate. 


Drinking, Driving, and the Supreme Court 
Source: Gastwirth (1988), pp. 524-528. 


Summary 


This case study doesn’t require you to make a personal decision about the results. 
Rather, it involves a decision that was made by the Supreme Court based on statisti- 
cal evidence and illustrates how laws can be affected by studies and statistics. 

In the early 1970s, a young man between the ages of 18 and 20 challenged an 
Oklahoma state law that prohibited the sale of 3.2% beer to males under 21 but al- 
lowed its sale to females of the same age group. The case (Craig v. Boren, 429 U.S. 
190, 1976) was ultimately heard by the U.S. Supreme Court, which ruled that the 
law was discriminatory. 

Laws are allowed to use gender-based differences as long as they “serve impor- 
tant governmental objectives” and “are substantially related to the achievement of 
these objectives” (Gastwirth, 1988, p. 524). The defense argued that traffic safety 
was an important governmental objective and that data clearly show that young 
males are more likely to have alcohol-related accidents than young females. 

The Court considered two sets of data. The first set, shown in Table 6.1, con- 
sisted of the number of arrests for driving under the influence and for drunkenness 
for most of the state of Oklahoma, from September | to December 31, 1973. The 
Court also obtained population figures for the age groups in Table 6.1. Based on 
those figures, they determined that the 1393 young males arrested for one of the two 
offenses in Table 6.1 represented 2% of the entire male population in the 18-21 age 
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Table 6.1 Arrests by Age and Sex in Oklahoma, 
September—December 1973 


Males Females 


18-21 Over 21 Total 18-21 Over21 Total 


Driving under influence 427 4,973 5,400 24 475 499 
Drunkenness 966 13,747 14,713 102 1,176 1,278 
Total 1,393 18,720 20,113 126 1,651 1,777 


group. In contrast, the 126 young females arrested represented only 0.18% of the 
young female population. Thus, the arrest rate for young males was about 10 times 
what it was for young females. 

The second set of data introduced into the case, partially shown in Table 6.2, 
came from a “random roadside survey” of cars on the streets and highways around 
Oklahoma City during August 1972 and August 1973. Surveys like these, despite the 
name, do not constitute a random sample of drivers. Information is generally col- 
lected by stopping some or all of the cars at certain locations, regardless of whether 
there is a suspicion of wrongdoing. 


Discussion 


Suppose you are a justice of the Supreme Court. Based on the evidence presented 
and the rules regarding gender-based differences, do you think the law should be up- 
held? Let’s go through the seven steps introduced in this chapter with a view toward 
making the decision the Court was required to make. 


Step 1: Determine if the research was a sample survey, a randomized experiment, 
an observational study, a combination, or based on anecdotes. 


Table 6.2 Random Roadside Survey of Driving and Drunkenness 
in Oklahoma City, August 1972 and August 1973 


Males Females 
Under 21 Over 21 Total Under 21 Over 21 Total 
BAC* over .01 55 357 412 13 52 65 
Total 481 1926 2407 138 565 703 
Percent with BAC over .01 11.4% 18.5% 17.1% 9.4% 9.2% 9.2% 


*BAC = Blood alcohol content 
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The numbers in Table 6.1 showing arrests throughout the state of Oklahoma for 
a 4-month period are observational in nature. The figures do represent most of the 
arrests for those crimes, but the people arrested are obviously only a subset of those 
who committed the crimes. The data in Table 6.2 constitute a sample survey, based 
on a convenience sample of drivers passing by certain locations. 


Step 2: Consider the Seven Critical Components in Chapter 2 (pp. 18-19) to famil- 
iarize yourself with the details of the research. 

A few details are missing, but you should be able to ascertain answers to most 
of the components. One missing detail is how the “random roadside survey” was 
conducted. 


Step 3: Based on the answer in step 1, review the “difficulties and disasters” inher- 
ent in that type of research and determine if any of them apply. 

The arrests in Table 6.1 were used by the defense to show that young males are 
much more likely to be arrested for incidents related to drinking than are young fe- 
males. But consider the confounding factors that may be present in the data. For ex- 
ample, perhaps young males are more likely to drive in ways that call attention to 
themselves, and thus they are more likely to be stopped by the police, whether they 
have been drinking or not. Thus, young females who were driving while drunk 
would not be noticed as often. For the data in Table 6.2, because the survey was 
taken at certain locations, the drivers questioned may not be representative of all dri- 
vers. For example, if a sports event had recently ended nearby, there may be more 
male drivers on the road, and they may have been more likely to have been drinking 
than normal. 


Step 4: Determine if the information is complete. If necessary, see if you can find 
the original source of the report or contact the authors for missing information. 

The information provided is relatively complete, except for the information on 
how the random roadside survey was conducted. According to Gastwirth (1994, per- 
sonal communication), this information was not supplied in the original documenta- 
tion of the court case. 


Step 5: Ask if the results make sense in the larger scope of things. If they are 
counter to previously accepted knowledge, see if you can get a possible explanation 
from the authors. 

Nothing is suspicious about the data in either table. Remember that in 1973, 
when the data were collected, the legal drinking age in the United States had not yet 
been raised to 21 years of age. 


Step 6: Ask yourself if there is an alternative explanation for the results. 

We have discussed one possible source of a confounding variable for the arrest 
statistics in Table 6.1—namely, that males may be more likely to be stopped for 
other traffic violations. Let’s consider the data in Table 6.2. Notice that almost 80% 
of the drivers stopped were male. Therefore, at least at that point in time in Okla- 
homa, males were more likely to be driving than females. That helps explain why 10 
times more young men than young women had been arrested for alcohol-related rea- 
sons. The important point for the law being challenged in this lawsuit was whether 
young men were more likely to be driving after drinking than young women. Notice 
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from Table 6.2 that of those cars with young males driving, 11.4% had blood alco- 
hol levels over 0.01; of those cars with young females driving, 9.4% had blood al- 
cohol levels over 0.01. These rates are statistically indistinguishable. 


Step 7: Determine if the results are meaningful enough to encourage you to change 
your lifestyle, attitudes, or beliefs on the basis of the research. 

In this case study, the important question is whether the Supreme Court justices 
were convinced that the gender-based difference in the law was reasonable. The 
Supreme Court overturned the law, concluding that the data in Table 6.2 “provides 
little support for a gender line among teenagers and actually runs counter to the im- 
position of drinking restrictions based upon age” (Gastwirth, 1988, p. 527). 


Smoking During Pregnancy and Child’s IQ 


ORIGINAL Source: Olds, Henderson, and Tatelbaum (February 1994), pp. 221-227. 
News Source: Study: Smoking may lower kids’ IQs (11 February 1994), p. A-10. 


Summary 


The news article for this case study is shown in Figure 6.1. 


Discussion 


Step 1: Determine if the research was a sample survey, a randomized experiment, 
an observational study, a combination, or based on anecdotes. 

This was an observational study because the researchers could not randomly as- 
sign mothers to either smoke or not during pregnancy; they could only observe their 
smoking behavior. 


Step 2: Consider the Seven Critical Components in Chapter 2 (pp. 18—19) to famil- 
iarize yourself with the details of the research. 

As in Case Study 6.1, the brevity of the news report necessarily meant that some 
details were omitted. Based on the original report, the seven questions can all be an- 
swered. Following is some additional information. 

The research was supported by a number of grants from sources such as the Bu- 
reau of Community Health Services, the National Center for Nursing Research, and 
the National Institutes of Health. None of the funders seems to represent special in- 
terest groups related to tobacco products. 

The researchers described the participants as follows: 


We conducted the study in a semirural county in New York State with a popula- 
tion of about 100,000. Between April 1978 and September 1980, we inter- 
viewed 500 primiparous women [those having their first live birth] who 
registered for prenatal care either through a free antepartum clinic sponsored 
by the county health department or through the offices of 11 private obstetri- 
cians. (All obstetricians in the county participated in the study.) Four hundred 
women signed informed consent to participate before their 30th week of preg- 
nancy. (Olds et al., 1994, p. 221) 


Figure 6.1 

Source: “Study: Smoking May 
Lower Kids’ IQs.” Associated 
Press, February 11, 1994. 
Reprinted with permission. 
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Study: Smoking May Lower Kids’ IOs 


RocHEsTER, N.Y. (AP)—Secondhand 
smoke has little impact on the intelligence 
scores of young children, researchers 
found. But women who light up while 
pregnant could be dooming their babies to 
lower IQs, according to a study released 
Thursday. Children ages 3 and 4 whose 
mothers smoked 10 or more cigarettes a 
day during pregnancy scored about 9 
points lower on the intelligence tests than 
the offspring of nonsmokers, researchers 
at Cornell University and the University 
of Rochester reported in this month’s Pe- 


diatrics journal. That gap narrowed to 4 
points against children of nonsmokers 
when a wide range of interrelated factors 
were controlled. The study took into ac- 
count secondhand smoke as well as diet, 
education, age, drug use, parents’ IQ, 
quality of parental care and duration of 
breast feeding. “It is comparable to the ef- 
fects that moderate levels of lead expo- 
sure have on children’s IQ scores,” said 
Charles Henderson, senior research asso- 
ciate at Cornell’s College of Human Ecol- 
ogy in Ithaca. 


The researchers also noted that “eighty-five percent of the mothers were either 
teenagers (<19 years at registration), unmarried, or poor. Analysis [was] limited to 
whites who comprised 89% of the sample” (p. 221). 

The explanatory variable, smoking behavior, was measured by averaging the re- 
ported number of cigarettes smoked at registration and at the 34th week of pregnancy. 
For the information included in the news report, the only two groups used were moth- 
ers who smoked an average of 10 or more cigarettes per day and those who smoked 
no cigarettes. Those who smoked between 1 and 9 per day were excluded. 

The response variable, IQ, was measured at 12 months with the Bayley Mental 
Development Index, at 24 months with the Cattell Scales, and at 36 and 48 months 
with the Stanford-Binet IQ test. In addition to those mentioned in the news source 
(secondhand smoke, diet, education, age, drug use, parents’ IQ, quality of parental 
care, and duration of breast feeding), other potential confounding variables measured 
were husband/boyfriend support, marital status, alcohol use, maternal depressive 
symptoms, father’s education, gestational age at initiation of prenatal care, and num- 
ber of prenatal visits. None of those were found to relate to intellectual functioning. 
It is not clear if the study was single-blind. In other words, did the researchers who 
measured the children’s IQs know about the mother’s smoking status or not? 


Step 3: Based on the answer in step 1, review the “difficulties and disasters” inher- 
ent in that type of research and determine if any of them apply. 

The study was prospective, so memory is not a problem. However, there are 
problems with potential confounding variables, and there may be a problem with try- 
ing to extend these results to other groups, such as older mothers. The fact that the 
difference in IQ for the two groups was reduced from 9 points to 4 points with the 
inclusion of several additional variables may indicate that the difference could be 
even further reduced by the addition of other variables. 
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The authors noted both of these as potential problems. They commented that “the 
particular sample used in this study limits the generalizability of the findings. The 
sample was at considerable risk from the standpoint of its sociodemographic char- 
acteristics, so it is possible that the adverse effects of cigarette smoking may not be 
as strong for less disadvantaged groups” (Olds et al., 1994, p. 225). 

The authors also mentioned two potential confounding variables. First, they 
noted, “We are concerned about the reliability of maternal report of illegal drug and 
alcohol use” (Olds et al., 1994, p. 225), and, “in addition, we did not assess fully the 
child’s exposure to side-stream smoke during the first four years after delivery” 
(Olds et al., 1994, p. 225). 


Step 4: Determine if the information is complete. If necessary, see if you can find 
the original source of the report or contact the authors for missing information. 
The information in the original report is fairly complete, but the news source left 
out some details that would have been useful, such as the fact that the subjects were 
young and of lower socioeconomic status than the general population of mothers. 


Step 5: Ask if the results make sense in the larger scope of things. If they are 
counter to previously accepted knowledge, see if you can get a possible explanation 
from the authors. 

The authors speculate on what the causal relationship might be, if indeed there 
is one. For example, they speculate that “tobacco smoke could influence the devel- 
oping fetal nervous system by reducing oxygen and nutrient flow to the fetus” 
(p. 226). They also speculate that “cigarette smoking may affect maternal/fetal nu- 
trition by increasing iron requirements and decreasing the availability of other nutri- 
ents such as vitamins, B12 and C, folate, zinc, and amino acids” (p. 226). 


Step 6: Ask yourself if there is an alternative explanation for the results. 

As with most observational studies, there could be confounding factors that were 
not measured and controlled. Also, if the researchers who measured the children’s 
IQs were aware of the mother’s smoking status, that could have led to some experi- 
menter bias. You may be able to think of other potential explanations. 


Step 7: Determine if the results are meaningful enough to encourage you to change 
your lifestyle, attitudes, or beliefs on the basis of the research. 

If you were pregnant and were concerned about allowing your child to have the 
highest possible IQ, these results may lead you to decide to quit smoking during the 
pregnancy. A causal connection cannot be ruled out. 


For Class Discussion: Guns and Homicides at Home 


ORIGINAL Source: Kellerman et al. (7 October 1993), pp. 1084-1091. 


Summary 
The news source read as follows: 


Challenging the common assumption that guns protect their owners, a multi- 
state study of hundreds of homicides has found that keeping a gun at home 
nearly triples the likelihood that someone in the household will be slain there. 
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The study, published in New England Journal of Medicine, studied the 
records of three populous counties surrounding Seattle, Washington, Cleveland, 
Ohio, and Memphis, Tennessee. The counties offered a sample representative of 
the entire nation because of the mix of urban, suburban, and rural communities. 

Although 1860 homicides occurred during the study period, the team looked 
only at those that occurred in the homes of the victims—about 400 deaths. The 
researchers found that members of households with guns were 2.7 times more 
likely to experience a homicide than those in households without guns. 

In nearly 77 percent of the cases, victims were killed by a relative or some- 
one they knew. In only about 4 percent of the cases were victims killed by a 
stranger. In most of the remaining cases, the identity of the persons who 
committed the homicides could not be determined. (Washington Post, 17—23 
October 1993) 


Mini-Projects 


1. Find a news article about a statistical study. Evaluate it using the seven steps on 
page 108. If all of the required information is not available in the news article, 
locate the journal article or other source of the research. As part of your analy- 
sis, make sure you discuss step 7 with regard to your own life. 


GO) 2. Choose one of the news stories in the Appendix and the accompanying material 
on the CD. Evaluate it using the seven steps on page 108. If all of the required 
information is not available in the news article, locate the journal article or other 
source of the research. As part of your analysis, make sure you discuss step 7 
with regard to your own life. 


3. Find the journal article in the New England Journal of Medicine on which Case 
Study 6.5 is based. Evaluate the study using the seven steps on page 108. 
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Finding Life in Data 


I, Part 1, you learned how data should be collected to be meaningful. In Part 
2, you will learn some simple things you can do with data after it has been col- 
lected. The goal of the material in this part is to increase your awareness of the 
usefulness of data and to help you interpret and critically evaluate what you 
read in the press. 

First, you will learn how to take a collection of numbers and summarize 
them in useful ways. For example, you will learn how to find out more about 
your own pulse rate by taking repeated measurements and drawing a useful 
picture. 

Second, you will learn to critically evaluate presentations of data made by 
others. From numerous examples of situations where the uneducated con- 
sumer could be misled, you will learn how to critically read and evaluate graphs, 


pictures, and data summaries. 
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Summarizing and Displaying 


Measurement Data 


Thought Questions 


1. 


If you were to read the results of a study showing that daily use of a certain exercise 
machine resulted in an average 10-pound weight loss, what more would you want 
to know about the numbers in addition to the average? (Hint: Do you think every- 
one who used the machine lost 10 pounds?) 


. Suppose you are comparing two job offers, and one of your considerations is the cost 


of living in each area. You get the local newspapers and record the price of 50 ad- 
vertised apartments for each community. What summary measures of the rent values 
for each community would you need in order to make a useful comparison? For in- 
stance, would the lowest rent in the list be enough information? 


. A real estate Web site for the Greenville, South Carolina area reported that the me- 


dian price of single family homes sold in the past 9 months in the local area was 
$136,900 and the average price was $161,447. How do you think these values are 
computed? Which do you think is more useful to someone considering the purchase 
of a home, the median or the average? (Source: http:/Awww.carolinahomesrealty 
.com/areahomes.htm, 26 October 2003.) 


. The Stanford-Binet IQ test is designed to have a mean, or average, of 100 for the en- 


tire population. It is also said to have a standard deviation of 16. What aspect of the 
population of IQ scores do you think is described by the “standard deviation”? For 
instance, do you think it describes something about the average? If not, what might 
it describe? 


. Students in a statistics class at a large state university were given a survey in which 


one question asked was age (in years). One student was a retired person, and her age 
was an “outlier.” What do you think is meant by an “outlier”? If the students’ 
heights were measured, would this same retired person necessarily have a value that 
was an “outlier”? Explain. 
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7.1 Turning Data into Information 


Looking at a long list of numbers is about as informative as looking at a scrambled 
set of letters. To get information out of data, the data have to be organized and sum- 
marized. As an example, suppose you were told that you received a score of 80 on 
an examination and that the scores in the class were as follows: 


75, 95, 60, 93, 85, 84, 76, 92, 62, 83, 80, 90, 64, 75, 79, 32, 78, 64, 98, 73, 88, 
61, 82, 68, 79, 78, 80, 55 


How useful would that list of numbers be to you, at first glance? Do you have any 
idea where you are relative to the rest of the class? The first thought that may occur 
to you is to put the numbers into increasing order so you could see where your score 
was relative to the others. Doing that, you find: 


32, 55, 60, 61, 62, 64, 64, 68, 73, 75, 75, 76, 78, 78, 79, 79, 80, 80, 82, 83, 84, 
85, 88, 90, 92, 93, 95, 98 


Now you can see that you are somewhat above the middle, but this list still isn’t easy 
to assimilate into a useful picture. It would help if we could summarize the numbers. 

There are four kinds of useful information about a set of data, and each can be 
measured and expressed in a variety of ways. These are the center (mean, median, 
mode), unusual values called outliers, the variability, and the shape. 


The Mean, Median, and Mode 


The first useful concept is the idea of the “center” of the data. What’s a typical or av- 
erage value? For the test scores just given, the numerical average, or mean, is 76.04. 
As another measure of “center” consider that there were 28 values in that test score 
set, so the median, with half of the scores above and half of the scores below it, is 
78.5, halfway between 78 and 79. 

Another measure of “center,” called the mode, is occasionally useful. The mode 
is simply the most common value in the list. For the exam scores, no single mode 
exists because each of the scores 64, 75, 78, 79 and 80 occurs twice. The mode is 
most useful for discrete or categorical data with a relatively small number of possi- 
ble values. For example, if you measured the class standing of all the students in 
your statistics class and coded them with 1 = freshman, 2 = sophomore, and so on, 
it would probably be more useful to know the mode (most common class standing) 
than to know the mean or the median. 


Outliers 


You can see that for the test scores, the median of 78.5 is somewhat higher than the 
mean of 76.04. That’s because a very low score, 32, pulled down the mean. It didn’t 
pull down the median because, as long as that very low score was 78 or less, its ef- 
fect on the median would be the same. 

If one or two scores are far removed from the rest of the data, they are called out- 
liers. There are no hard and fast rules for determining what qualifies as an outlier, 
but we will learn some guidelines that are often used in identifying them. In this 


CHAPTER 7 Summarizing and Displaying Measurement Data 125 


case, most people would agree that the score of 32 is so far removed from the other 
values that it definitely qualifies as an outlier. 


Variability 


The second kind of useful information contained in a set of data is the variability. 
How spread out are the values? Are they all close together? Are most of them to- 
gether, but a few are outliers? Knowing that the mean is about 76, your test score of 
80 is still hard to interpret. It would obviously have a different connotation for you 
if the scores ranged from 72 to 80 rather than from 32 to 98. 

The idea of natural variability, introduced in Chapter 3, is particularly important 
when summarizing a set of measurements. Much of our work in statistics involves 
comparing an observed difference to what we should expect if the difference is due 
solely to natural variability. For instance, to determine if global warming is occur- 
ring, we need to know how much the temperatures in a given area naturally vary 
from year to year. To determine if a one-year-old child is growing abnormally 
slowly, we need to know how much heights of one-year-old children naturally vary. 


Minimum, Maximum, and Range 

The simplest measure of variability is to find the minimum value and the maximum 
value and to compute the range, which is just the difference between them. In the 
case of the test scores, the scores went from 32 to 98, for a range of 66 points. Tem- 
peratures on a given date in a certain location may range from a record low of 59 de- 
grees Fahrenheit to a record high of 90 degrees, a 31-degree range. We introduce two 
more measures of variability, the interquartile range and the standard deviation, later 
in this chapter. 


Shape 


The third kind of useful information is the shape, which can be derived from a cer- 
tain kind of picture of the data. We can answer questions such as: Are most of the 
values clumped in the middle with values tailing off at each end? Are there two dis- 
tinct groupings? Are most of the values clumped together at one end with a few very 
high or low values? You can see that your score of 80 would have different meanings 
depending on how the other students’ scores grouped together. For example, if half 
of the remaining students had scores of 50 and the other half scores of 100, then even 
though your score of 80 was “above average,” it wouldn’t look so good. Next we fo- 
cus on how to look at the shape of the data. 


7.2 Picturing Data: Stemplots and Histograms 


About Stemplots 


A stemplot is a quick and easy way to put a list of numbers into order while getting 
a picture of their shape. The easiest way to describe a stemplot is to construct one. 
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Figure 7.1 


Building a stemplot of 


test scores 


Step 1 Step 2 Step 3 
Creating the stem Attaching leaves The finished stemplot 
3 3 3|2 
4 4 4 
5 5 5|5 
6 6|0 6|024418 
7 7(5 7|56598398 
8 8 8|5430820 
9 9/53 9|53208 


Example 3|2 = 32 


Let’s first use the test scores we’ve been discussing, then we will turn to some real 
data, where each number has an identity. Before reading any further, look at the 
right-most part of Figure 7.1 so you can see what a completed stemplot looks like. 
Each of the digits extending to the right represents one data point. The first thing you 
see is 3|2. That represents the lowest test score of 32. Each of the digits on the right 
represents one test score. For instance, see if you can locate the highest score, 98. 
It’s the last value to the right of the “stem” value of 9]. 


Creating a Stemplot 


Stemplots are sometimes called stem-and-leaf plots or stem-and-leaf diagrams. 
Only two steps are needed to create a stemplot—creating the stem and attaching the 
leaves. 


Step 1: Create the stems. 


The first step is to divide the range of the data into equal units to be used on the stem. 
The goal is to have from 6 to 15 stem values, representing equally spaced intervals. 
In the example shown in Figure 7.1, each of the seven stem values represents a range 
of 10 points in test scores. For instance, any score in the 80s, from 80 to 89.99, would 
be placed after the 8| on the stem. 


Step 2: Attach the leaves. 


The second step is to attach a leaf to represent each data point. The next digit in the 
number is used as the leaf, and if there are any remaining digits they are simply 
dropped. Let’s use the unordered list of test scores first displayed: 


75, 95, 60, 93, 85, 84, 76, 92, 62, 83, 80, 90, 64, 75, 79, 32, 78, 64, 98, 73, 88, 
61, 82, 68, 79, 78, 80, 55 


The middle part of Figure 7.1 shows the picture after leaves have been attached for 
the first four test scores, 75, 95, 60, and 93. The finished picture, on the right, has 
the leaves attached for all 28 scores. Sometimes an additional step is taken and the 
leaves are ordered numerically on each branch. 


Figure 7.2 
Two stemplots for the 
same pulse rate data 


EXAMPLE 1 
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Stemplot A Stemplot B 
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Further Details for Creating Stemplots 


Suppose you wanted to create a picture of what your own pulse rate is when you are 
relaxed. You collect 25 values over a series of a few days and find that they range 
from 54 to 78. If you tried to create a stemplot using the first digit as the stem, you 
would have only three stem values (5, 6 and 7). If you tried to use both digits for 
the stem, you could have as many as 25 separate values, and the picture would be 
meaningless. 

The solution to this problem is to reuse each of the digits 5, 6, and 7 in the stem. 
Because you need to have equally spaced intervals, you could use each of the digits 
two or five times. If you use them each twice, the first listed would receive leaves 
from 0 to 4, and the second would receive leaves from 5 to 9. Thus, each stem value 
would encompass a range of five beats per minute of pulse. If you use each digit five 
times, each stem value would receive leaves of two possible values. The first stem 
for each digit would receive 0 and 1, the second would receive 2 and 3, and so on. 
Notice that if you tried to use the initial pulse digits three or four times each, you 
could not evenly divide the leaves among them because there are always 10 possible 
values for leaves. Figure 7.2 shows two possible stemplots for the same hypotheti- 
cal pulse data. Stemplot A shows the digits 5, 6, and 7 used twice; stemplot B shows 
them used five times. (The first two 5’s are not needed and not shown.) 


Stemplot of Median Income for Families of Four 


Table 7.1 lists the estimated median income for a four-person family in 2001 for each of 
the 50 states and the District of Columbia, information released by the U.S. government 
in April 2003 for use in setting government aid levels. Scanning the list gives us some 
information, but it would be easier to get the big picture if it were in some sort of nu- 
merical order. We could simply list the states by value instead of alphabetically, but that 
would not give us a picture of the shape. 

The first step is to decide what values to use for the stem. The median family in- 
comes range from a low of $46,596 (for New Mexico) to a high of $82,879 (for Mary- 
land), for a range of $36,283. The goal is to use the first digit or few digits in each 
number as the stem, in such a way that the stem is divided into 6 to 15 equally spaced 
intervals. 
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Table 7.1 2001 Median Income for a Family of Four 


Alabama $54,594 Montana $48,078 


Alaska $71,395 Nebraska $60,626 
Arizona $56,067 Nevada $59,283 
Arkansas $47,838 New Hampshire $72,606 
California $63,761 New Jersey $80,577 
Colorado $67,634 New Mexico $46,596 
Connecticut $82,517 New York $66,498 
Delaware $73,301 North Carolina $56,500 
District of Columbia $61,799 North Dakota $55,138 
Florida $56,824 Ohio $64,282 
Georgia $59,497 Oklahoma $53,949 
Hawaii $66,014 Oregon $58,737 
Idaho $51,098 Pennsylvania $66,130 
Illinois $66,507 Rhode Island $70,446 
Indiana $63,573 South Carolina $59,212 
lowa $61,656 South Dakota $59,718 
Kansas $61,686 Tennessee $56,052 
Kentucky $54,319 Texas $56,606 
Louisiana $51,234 Utah $59,035 
Maine $58,425 Vermont $62,938 
Maryland $82,879 Virginia $69,616 
Massachusetts $80,247 Washington $65,997 
Michigan $68,337 West Virginia $49,470 
Minnesota $72,635 Wisconsin $65,441 
Mississippi $46,810 Wyoming $58,541 
Missouri $61,036 


Source: Federal Registry, April 15, 2003, http://a257.g.akamaitech.net/7/257/2422/14mar20010800/ 
edocket.access.gpo.gov/2003/03-9088.htm 


If we use the first digit in each income value once, ranging from 4, representing in- 
comes in the $40,000s, to 8, representing incomes in the $80,000s, we would have only 
four values on the stem. Because we need each part of the stem to represent the same 
range, we have two other choices. We can divide each group of $10,000 into two in- 
tervals of $5000 each, or we can divide them into five intervals of $2000 each. If we di- 
vide the incomes into intervals of $5000, we will need to begin the stem with the second 
half of the $40,000 range and end it with the first half of the $80,000 range, resulting 
in stem values of 4, 5, 5, 6, 6, 7, 7, 8 for a total of eight stem values. If we use intervals 
of $2000, we will need to begin the stem with a value representing incomes in the 
$46,000s and end with a value representing incomes in the $82,000s. That would re- 
quire two stem values of 4, five stem values for each of 5, 6, and 7, and two stem val- 
ues of 8 for a total of 18 stem values. That exceeds the number of intervals normally 
used, so we will divide incomes into intervals of $5000. 

Figure 7.3 shows the completed stemplot. Notice that the leaves have been put in 
order. Notice also that the income values have been truncated instead of rounded. To 
truncate a number, simply drop off the unused digits. Thus, the lowest income of 
$46,596 for New Mexico is truncated to $46,000 instead of rounded to $47,000. 
Rounding could be used instead. E 


Figure 7.3 
Stemplot of median 
incomes for families 
of four 


4|66789 
5/11344 
5|56666688899999 
6}011112334 
6|556666789 
7|01223 

7 
8 


0022 
Example: 4|6 = $46,xxx 


EXAMPLE 2 
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Obtaining Information from the Stemplot 


Stemplots help us determine the “shape” of a data set, identify outliers, and locate 
the center. For instance, the pulse rates in Figure 7.2 have a “bell shape” in which 
they are centered in the mid-60s and tail off in both directions from there. There are 
no outliers. The stemplot of test scores in Figure 7.1 clearly illustrates the outlier of 
32. Aside from that and the score of 55, they are almost uniformly distributed in the 
60s, 70s, 80s, and 90s. 

From the stemplot of median income data in Figure 7.3, we can make several ob- 
servations. First, there is a wide range of values, with the median income in Mary- 
land, the highest, being almost twice that of New Mexico, the lowest. Second, there 
appear to be four states with unusually high median family incomes, all in the 
$80,000 range. From Table 7.1 we can see that these are Connecticut, Maryland, 
Massachusetts, and New Jersey. Then there is a gap before reaching Delaware, in the 
$73,000 range. The remaining states tend to be almost “bell-shaped” with a center 
around the high $50,000s or low $60,000s. There are no obvious outliers. 

If we were interested in what factors determine income levels, we could use this 
information from the stemplot to help us. We would pursue questions like “What is 
different about the four high-income states?” We might notice that much of their 
population works in high-income cities. Many New York City employees live in 
Connecticut and New Jersey, and Washington, D.C. employees live in Maryland. 
Much of the population of Massachusetts lives and works in the Boston area. 


Creating a Histogram 


Histograms are pictures related to stemplots. For very large data sets, a histogram 
is more feasible than a stemplot because it doesn’t list every data value. 

To create a histogram, divide the range of the data into intervals in much the 
same way as we did when creating a stemplot. But instead of listing each individual 
value, simply count how many fall into each part of the range. Draw a bar whose 
height is equal to the count for each part of the range. Or, equivalently, make the 
height equal to the proportion of the total count that falls in that interval. 

Figure 7.4 shows a histogram for the income data from Table 7.1. Notice that the 
heights of the bars are represented as frequencies. For example, there are four val- 
ues in the highest median income range, centered on $81,000. If this histogram had 
used the proportion in each category on the vertical axis instead, then the height of 
the bar centered on $81,000 would be 4/51 or about 0.08. In that case, the heights 
of all of the bars must sum to 1, or 100%. If you wanted to know what proportion 
of the data fell into a certain interval or range, you would simply sum the heights of 
the bars for that range. Also, notice that if you were to turn the histogram on its side, 
it would look very much like a stemplot except that the labels would differ slightly. 


Heights of British Males 


Figure 7.5 displays a histogram of the heights, in millimeters, of 199 randomly selected 
British men. (Marsh, 1988, p. 315; data reproduced in Hand et al., 1994, pp. 179-183). 
The histogram is rotated sideways from the one in Figure 7.4. Some computer programs 
display histograms with this orientation. Notice that the heights create a “bell shape” 
with a center in the mid-1700s (millimeters). There are no outliers. E 
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Figure 7.4 
Histogram of median 
family income data 


Figure 7.5 

Heights of British males 

in millimeters (V = 199) 
Source: Data disk from Hand et 
al., 1994. 
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EXAMPLE 4 
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1550 1 

1600 12 
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§ 1700 61 
€ 1750 56 
= 1800 30 | 

1850 14 
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1950 1 | 

| | | l > 
0 20 40 60 80 


Number of men 


The Old Faithful Geyser 


Figure 7.6 shows a histogram of the times between eruptions of the “Old Faithful” 
geyser. Notice that the picture appears to have two clusters of values, with one centered 
around 50 minutes and another, larger cluster centered around 80 minutes. A picture like 
this may help scientists figure out what causes the geyser to erupt when it does. a 


How Much Do Students Exercise? 


Students in an introductory statistics class were asked a number of questions on the first 
day of class. Figure 7.7 shows a histogram of 172 responses to the question “How many 
hours do you exercise per week (to the nearest half hour)?” Notice that the bulk of the 
responses are in the range from O to 10 hours, with a mode of 2 hours. But there are 
responses trailing out to a maximum of 30 hours a week, with 5 responses at or above 
20 hours a week. a 


Figure 7.6 

Times between erup- 
tions of “Old Faithful” 
geyser (N = 299) 
Source: Hand et al., 1994. 


Figure 7.7 
Self-reported hours of 
exercise for 172 college 
students 

Source: The author's students. 
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Hours of exercise 


Defining a Common Language about Shape 


Symmetric Data Sets 

Scientists often talk about the “shape” of data; what they really mean is the shape of 
the stemplot or histogram resulting from the data. A symmetric data set is one in 
which, if you were to draw a line through the center, the picture on one side would 
be a mirror image of the picture on the other side. A special case, which will be dis- 
cussed in detail in Chapter 8, is a bell-shaped data set, in which the picture is not 
only symmetric but also shaped like a bell. The stemplots in Figure 7.2, displaying 
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pulse rates, and Figure 7.5, displaying male heights, are approximately symmetric 
and bell-shaped. 


Unimodal or Bimodal 

Recall that the mode is the most common value in a set of data. If there is a single 
prominent peak in a histogram or stemplot, as in Figures 7.2 and 7.5, the shape is 
called unimodal, meaning “one mode.” If there are two prominent peaks, the shape 
is called bimodal, meaning “two modes.” Figure 7.6, displaying the times between 
eruptions of the Old Faithful geyser, is bimodal. There is one peak around 50 min- 
utes, and a higher peak around 80 minutes. 


Skewed Data Sets 

In common language, something that is skewed is off-center in some way. In statis- 
tics, a skewed data set is one that is basically unimodal but is substantially off from 
being bell-shaped. If it is skewed to the right, the higher values are more spread out 
than the lower values. Figure 7.7, displaying hours of exercise per week for college 
students, is an example of data skewed to the right. If a data set is skewed to the left, 
then the lower values are more spread out and the higher ones tend to be clumped. 
This terminology results from the fact that before computers were used, shape pic- 
tures were always hand drawn using the horizontal orientation in Figure 7.4. Notice 
that a picture that is skewed to the right, like Figure 7.7, extends further to the right 
of the highest peak (the tallest bar) than to the left. Most students think the termi- 
nology should be the other way around, so be careful to learn this definition! The di- 
rection of the “skew” is the direction with the unusual values, and not the direction 
with the bulk of the data. 


7.3 Five Useful Numbers: A Summary 


A five-number summary is a useful way to summarize a long list of numbers. As 
the name implies, this is a set of five numbers that provide a good summary of the 
entire list. Figure 7.8 shows what the five useful numbers are and the order in which 
they are usually displayed. 

The lowest and highest values are self-explanatory. The median, which we dis- 
cussed earlier, is the number such that half of the values are at or above it and half 
are at or below it. If there is an odd number of values in the data set, the median is 
simply the middle value in the ordered list. If there is an even number of values, the 
median is the average of the middle two values. For example, the median of the list 
70, 75, 85, 86, 87 is 85 because it is the middle value. If the list had an additional 
value of 90 in it, the median would be 85.5, the average of the middle two numbers, 
85 and 86. Make sure you find the middle of the ordered list of values. 

The median can be found quickly from a stemplot, especially if the leaves have 
been ordered. Using Figure 7.3, convince yourself that the median of the family in- 
come data is the 26th value (51 = 25 + 1 + 25) from either end, which is the low- 
est of the $61,000 values. Consulting Table 7.1, we can see that the actual value is 
$61,036, the value for Missouri. 


Figure 7.8 
The five-number sum- 
mary display 
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Median 
Lower quartile Upper quartile 
Lowest Highest 


The quartiles are simply the medians of the two halves of the ordered list. The 
lower quartile—because it’s halfway into the first half—is one quarter of the way 
from the bottom. Similarly, the upper quartile is one quarter of the way down from 
the top. Complicated algorithms exist for finding exact quartiles. We can get close 
enough by simply finding the median first, then finding the medians of all the num- 
bers below it and all the numbers above it. For the family income data, the lower 
quartile would be the median of the 25 values below the median of $61,036. Notice 
that this would be the 13th value from the bottom because 25 = 12 + 1 + 12. Count- 
ing from the bottom, the 13th value is the second of the ones in the $56,000s. Con- 
sulting Table 7.1, the value is $56,067 (Arizona). The upper quartile would be the 
median of the upper 25 values, which is the highest of the values in the $66,000s. 
Consulting Table 7.1, we see that it is $66,507 (Illinois). This tells us that three- 
fourths of the states have median family incomes at or below that for Illinois, 
$66,507. 

The five-number summary for the family income data is thus: 


$61,036 
$56,067 $66,507 
$46,596 $82,879 


These five numbers provide a useful summary of the entire set of 51 numbers. We 
can get some idea of the middle, the spread, and whether or not the values are 
clumped at one end or the other. The gap between the first quartile and the median 
($4969) isn’t much different from the gap between the median and the third quartile 
($5471), so the values in the middle are fairly symmetric. The gap between the ex- 
tremes and the quartiles are larger than between the quartiles and the median, indi- 
cating that the values are more tightly clumped in the mid-range than at the ends. 
Because a slightly larger gap exists between the third quartile of $66,507 and the 
high of $82,879 than between the low of $46,596 and the first quartile of $56,067, 
we know that the values are probably more clumped at the lower end and more 
spread out at the upper end. 

Note that in using stemplots to find five-number summaries we won’t always be 
able to consult the full set of data values. Remember that we dropped the last three 
digits on the family incomes when we created the stemplot. If we had used the stem- 
plot only, the family income values in the five-number summary (in thousands) 
would have been $46, $56, $61, $66 and $82. All of the conclusions we made in the 
previous paragraph would still be obvious. In fact, they may be more obvious, be- 
cause the arithmetic to find the gaps would be much simpler. Truncated values from 
the stemplot are generally close enough to give us the picture we need. 
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7.4 Boxplots 


EXAMPLE 5 


A visually appealing and useful way to present a five-number summary is through a 
boxplot, sometimes called a box and whisker plot. This simple picture also allows 
easy comparison of the center and spread of data collected for two or more groups. 


How Much Do Statistics Students Sleep? 

During spring semester 1998, 190 students in a statistics class at a large university were 
asked to answer a series of questions in class one day, including how many hours they 
had slept the night before (a Tuesday night). A five-number summary for the reported 
number of hours of sleep is 


7 


3 16 


Two individuals reported that they slept 16 hours; the maximum for the remaining 188 
students was 12 hours. a 


Creating a Boxplot 


The boxplot for the hours of sleep is presented in Figure 7.9, and illustrates how a 
boxplot is constructed. Here are the steps: 


1. Draw a horizontal or vertical line and label it with values from the lowest to the 
highest values in the data. For the example in Figure 7.9, a horizontal line is used 
and the labeled values range from 3 to 16 hours. 


2. Draw a rectangle, or box, with the ends of the box at the lower and upper quar- 
tiles. In Figure 7.9, the ends of the box are at 6 and 8 hours. 


3. Draw a line in the box at the value of the median. In Figure 7.9 the median is at 
7 hours. 


4. Compute the width of the box. This distance is called the interquartile range be- 
cause it’s the distance between the lower and upper quartiles. It’s abbreviated as 
“IQR.” For the sleep data, the IQR is 2 hours. 


5. Compute 1.5 times the IQR. For the sleep data, this is 1.5 X 2 = 3 hours. Define 
an outlier to be any value that is more than this distance from the closest end of 
the box. For the sleep data, the ends of the box are 6 and 8, so any value below 
(6 — 3) = 3, or above (8 + 3) = 11, is an outlier. 

6. Draw a line or “whisker” at each end of the box that extends from the ends of the 
box to the farthest data value that isn’t an outlier. If there are no outliers, these 
will be the minimum and maximum values. In Figure 7.9, the whisker on the left 
extends to the minimum value of 3 hours but the whisker on the right stops at 11 
hours. 


7. Draw asterisks to indicate data values that are beyond the whiskers, and are thus 
considered to be outliers. In Figure 7.9 we see that there are two outliers, at 12 
hours and 16 hours. 


Figure 7.9 


Boxplot for hours 


of sleep 


EXAMPLE 6 
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3 4 5 6 7 8 9 10 11 12 13 14 15 16 
Hours of sleep 


If all you have is the information contained in a five-number summary, you can 
draw a skeletal boxplot instead. The only change is that the whiskers don’t stop un- 
til they reach the minimum and maximum, and thus outliers are not specifically iden- 
tified. You can still determine if there are any outliers at each end by noting whether 
the whiskers extend more than 1.5 X IQR. If so, you know that the minimum or 
maximum value is an outlier, but you don’t know if there are any other, less extreme 
outliers. 


Interpreting Boxplots 


Notice that boxplots essentially divide the data into fourths. The lowest fourth of the 
data values is contained in the range of values below the start of the box, the next 
fourth is contained in the first part of the box (between the lower quartile and the me- 
dian), the next fourth is in the upper part of the box, and the final fourth is between 
the box and the upper end of the picture. Outliers are easily identified. Notice that 
we are now making the definition of an outlier explicit. 


An outlier is defined to be any value that is more than 1.5 X IQR beyond the 
closest quartile. 


In the boxplot in Figure 7.9, we can see that one-fourth of the students slept be- 
tween 3 and 6 hours the previous night, one-fourth slept between 6 and 7 hours, one- 
fourth slept between 7 and 8 hours, and the final fourth slept between 8 and 16 hours. 
We can thus immediately see that the data are skewed to the right because the final 
fourth covers an 8-hour period, whereas the lowest fourth covers only a 3-hour 
period. 

As the next example illustrates, boxplots are particularly useful for comparing 
two or more groups on the same measurement. Although almost the same informa- 
tion is contained in five-number summaries, the visual display makes similarities 
and differences much more obvious. 


Who Are Those Crazy Drivers? 


The survey taken in the statistics class in Example 5 also included the question “What's 
the fastest you have ever driven a car? mph.” The boxplots in Figure 7.10 
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Figure 7.10 
Boxplots for fastest ever 
driven a car 


Sex 


illustrate the comparison of the responses for males and females. Here are the corre- 
sponding five-number summaries. (There are only 189 students because one didn’t an- 
swer this question.) 


Males (87 Students) Females (102 Students) 
110 89 

95 120 80 95 

55 150 30 130 


Some features are more immediately obvious in the boxplots than in the five-number 
summaries. For instance, the lower quartile for the men is equal to the upper quartile 
for the women. In other words, 75% of the men have driven 95 mph or faster, but only 
25% of the women have done so. Except for a few outliers (120 and 130), all of the 
women’s maximum driving speeds are close to or below the median for the men. No- 
tice how useful the boxplots are for comparing the maximum driving speeds for the 
sexes. E 


7.5 Traditional Measures: Mean, Variance, 
and Standard Deviation 


The five-number summary has come into use relatively recently. Traditionally, only 
two numbers have been used to describe a set of numbers: the mean, representing 
the center, and the standard deviation, representing the spread or variability in the 
values. Sometimes the variance is given instead of the standard deviation. The stan- 
dard deviation is simply the square root of the variance, so once you have one you 
can easily compute the other. 

The mean and standard deviation are most useful for symmetric sets of data with 
no outliers. However, they are very commonly quoted, so it is important to under- 
stand what they represent, including their uses and their limitations. 


CHAPTER 7 Summarizing and Displaying Measurement Data 137 


The Mean and When to Use It 


As we discussed earlier, the mean is the numerical average of a set of numbers. In 
other words, we add up the values and divide by the number of values. The mean can 
be distorted by one or more outliers and is thus most useful when there are no extreme 
values in the data. For example, suppose you are a student taking four classes, and the 
number of students in each is, respectively, 20, 25, 35, and 200. What is your typical 
class size? Notice that the median is 30 students. The mean, however, is 280/4 or 70 
students. The mean is severely affected by the one large class size of 200 students. 

As another example, refer to Figure 7.7 from Example 4, which displays hours 
per week students reportedly exercise. The majority of students exercised 10 hours 
or less, and the median is only 3 hours. But because there were a few very high val- 
ues, the mean amount is 4.5 hours a week. It would be misleading to say that stu- 
dents exercise an average of 4.5 hours a week. In this case, the median is a better 
measure of the center of the data. 

Data involving incomes or prices of things like houses and cars often are skewed 
to the right with some large outliers. They are unlikely to have extreme outliers at 
the lower end because monetary values can’t go below 0. Because the mean can be 
distorted by the high outliers, data involving incomes or prices are usually summa- 
rized using the median. For example, the median price of a house in a given area, in- 
stead of the mean price, is routinely quoted in the newspaper. That’s because one 
house that sold for several million dollars would substantially distort the mean but 
would have little effect on the median. This is evident in Thought Question 3, where 
it is reported that the median price in a certain area was $136,900 but the average 
price, the mean, was $161,447. 

The mean is most useful for symmetric data sets with no outliers. In such cases, 
the mean and median should be about equal. As an example, notice that the British 
male heights in Figure 7.5 fit that description. The mean height is 1732.5 millimeters 
(about 68.25 inches) and the median height is 1725 millimeters (about 68 inches). 


The Standard Deviation and Variance 


It is not easy to compute the standard deviation of a set of numbers, but most cal- 
culators and computer programs such as Excel now handle that task for you. It is 
more important to know how to interpret the standard deviation, which is a useful 
measure of how spread out the numbers are. Consider the following two sets of num- 
bers, both with a mean of 100: 


Numbers Mean Standard Deviation 
100, 100, 100, 100, 100 100 0 
90, 90, 100, 110, 110 100 10 


The first set of numbers has no spread or variability to it at all. It has a standard de- 
viation of 0. The second set has some spread to it; on average, the numbers are about 
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10 points away from the mean, except for the number that is exactly at the mean. 
That set has a standard deviation of 10. 


Computing the Standard Deviation 
Here are the steps necessary to compute the standard deviation: 


. Find the mean. 
. Find the deviation of each value from the mean: value — mean. 


. Square the deviations. 


1 

2 

3 

4. Sum the squared deviations. 

5. Divide the sum by (the number of values) — 1, resulting in the variance. 
6 


. Take the square root of the variance. The result is the standard deviation. 
Let’s try this for the set of values 90, 90, 100, 110, 110. 


1. The mean is 100. 

2. The deviations are —10, —10, 0, 10, 10. 

3. The squared deviations are 100, 100, 0, 100, 100. 

4. The sum of the squared deviations is 400. 

5. The (number of values) — 1 = 5 — 1 = 4, so the variance is 400/4 = 100. 
6 


. The standard deviation is the square root of 100, or 10. 


Although it may seem more logical in step 5 to divide by the number of values, 
rather than by the number of values minus 1, there is a technical reason for sub- 
tracting 1. The reason is beyond the level of this discussion but concerns statistical 
bias, as discussed in Chapter 3. 

The easiest interpretation is to recognize that the standard deviation is roughly 
the average distance of the observed values from their mean. Where the data have a 
bell shape, the standard deviation is quite useful indeed. For example, the Stanford- 
Binet IQ test is designed to have a mean of 100 and a standard deviation of 16. If we 
were to produce a histogram of IQs for a large group representative of the whole 
population, we would find it to be approximately bell-shaped. Its center would be at 
100. If we were to determine how far each person’s IQ fell from 100, we would find 
an average distance, on one side or the other, of about 16 points. (In the next chap- 
ter, we will see how to use the standard deviation of 16 in a more useful way.) For 
shapes other than bell shapes, the standard deviation is useful as an intermediate tool 
for more advanced statistical procedures; it is not very useful on its own, however. 


7.6 Caution: Being Average Isn’t Normal 


By now, you should realize that it takes more than just an average value to describe 
a set of measurements. Yet, it is a common mistake to confuse “average” with “nor- 
mal.” For instance, if a young boy is tall for his age, people might say something like 


EXAMPLE 7 


CASE STUDY 7.1 
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“He’s taller than normal for a three-year-old.” In fact, what they mean is that he’s 
taller than the average height of three-year-old boys. In fact, there is quite a range of 
possible heights for three-year-old boys, and as we will learn in Chapter 8, any 
height within a few standard deviations of the mean is quite “normal.” Be careful 
about confusing “‘average” and “normal” in your everyday speech. 

Equating “normal” with average is particularly common in weather data re- 
portage. News stories often confuse these. When reporting rainfall data, this confu- 
sion leads to stories about drought and flood years when in fact the rainfall for the 
year is well within a “normal” range. If you pay attention, you will notice this mis- 
take being made in almost all news reports about the weather. 


How Much Hotter Than Normal Is Normal? 


It’s true that the beginning of October, 2001 was hot in Sacramento, California. But how 
much hotter than “normal” was it? According to the Sacramento Bee: 


October came in like a dragon Monday, hitting 101 degrees in Sacramento by 
late afternoon. That temperature tied the record high for Oct. 1 set in 1980—and 
was 17 degrees higher than normal for the date. (Korber, 2001) 


The article was accompanied by a drawing of a thermometer showing that the “Normal 
High” for the day was 84 degrees. This is the basis for the statement that the high of 
101 degrees was 17 degrees higher than normal. But the high temperature for October 
1 is quite variable. October is the time of year when the weather is changing from sum- 
mer to fall, and it’s quite natural for the high temperature to be in the 70s, 80s, or 90s. 
While 101 was a record high, it was not “17 degrees higher than normal” if “normal” 
includes the range of possibilities likely to occur on that date. a 


Detecting Exam Cheating with a Histogram 
Source: Boland and Proschan (Summer 1991), pp. 10-14. 


It was summer 1984, and a class of 88 students at a university in Florida was taking 
a 40-question multiple-choice exam. The proctor happened to notice that one stu- 
dent, whom we will call C, was probably copying answers from a student nearby, 
whom we will call A. Student C was accused of cheating, and the case was brought 
before the university’s supreme court. 

At the trial, evidence was introduced showing that of the 16 questions missed by 
both A and C, both had made the same wrong guess on 13 of them. The prosecution 
argued that a match that close by chance alone was very unlikely, and student C was 
found guilty of academic dishonesty. The case was challenged, however, partly be- 
cause in calculating the odds of such a strong match, the prosecution had used an un- 
reasonable assumption. They assumed that any of the four wrong answers on a 
missed question would be equally likely to be chosen. Common sense, as well as 
data from the rest of the class, made it clear that certain wrong answers were more 
attractive choices than others. 

A second trial was held, and this time the prosecution used a more reasonable 
statistical approach. The prosecution created a measurement for each student in the 
class except A (the one from whom C allegedly copied), resulting in 87 data values. 
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Figure 7.11 


* 
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For each student, the prosecution simply counted how many of his or her 40 answers 
matched the answers on A’s paper. The results are shown in the histogram in Fig- 
ure 7.11. Student C is coded as a C, and each asterisk represents one other student. 
Student C is an obvious outlier in an otherwise bell-shaped picture. You can see that 
it would be quite unusual for that particular student to match A’s answers so well 
without some explanation other than chance. 

Unfortunately, the jury managed to forget that the proctor observed Student C 
looking at Student A’s paper. The defense used this oversight to convince them that, 
based only on the histogram, A could have been copying from C. The guilty verdict 
was overturned, despite the compelling statistical picture and evidence. 


For Those Who Like Formulas 


The Data 


n = number of observations 


x; = the ith observation, i = 1,2,...,n 


The Mean 


The Variance 
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The Computational Formula for the Variance (easier to compute directly 
with a calculator) 


1 fe, (28) 
s = 5 gai 


mn-1 £ n 


The Standard Deviation 


Use either formula to find s?, then take the square root to get the standard deviation s. 


Exercises 


Asterisked (*) exercises are included in the Solutions at the back of the book. 


1. At the beginning of this chapter, the following exam scores were listed and a 
stemplot for them was shown in Figure 7.1: 75, 95, 60, 93, 85, 84, 76, 92, 62, 
83, 80, 90, 64, 75, 79, 32, 78, 64, 98, 73, 88, 61, 82, 68, 79, 78, 80, 55. 


a. Create a stemplot for the test scores using each 10s value twice instead of 
once on the stem. 


b. Compare the stemplot created in part a with the one in Figure 7.1. Are any 
features of the data apparent in the new stemplot that were not apparent in 
Figure 7.1? Explain. 


2. Refer to the test scores in Exercise 1. 
a. Create a five-number summary. 
b. Create a boxplot. 
3. Create a histogram for the test scores in Exercise 1. Comment on the shape. 


*4. Give an example for which the median would be more useful than the mean as 
a measure of center. 


5. Give an example of a set of five numbers with a standard deviation of 0. 


6. Give an example of a set of more than five numbers that has a five-number sum- 


mary of 
40 
30 70 
10 80 


7. All the information contained in the five-number summary for a data set is re- 
quired for constructing a boxplot. What additional information is required? 


8. Find the mean and standard deviation of the following set of numbers: 10, 20, 
25, 30, 40. 


*9, Refer to the pulse rate data displayed in the stemplots in Figure 7.2. 
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10. 


11. 


12. 


*13. 


*a. Find the median. 
*b. Create a five-number summary. 


The data on hours of sleep discussed in Example 5 also included whether each 
student was male or female. Here are the separate five-number summaries for 
“hours of sleep” for the two sexes: 


Males Females 


7 7 


Oo 
ee) 
(ep) 


a. Two males reported sleeping 16 hours and one reported sleeping 12 hours. 
Using this information and the five-number summaries, draw boxplots that 
allow you to compare the sexes on number of hours slept the previous night. 
Use a format similar to Figure 7.10. 


b. Based on the boxplots in part a, describe the similarities and differences be- 
tween the sexes for number of hours slept the previous night. 


Refer to the data on median family income in Table 7.1; a five-number summary 
is given in Section 7.3, page 133. 


a. What is the value of the range? 
b. What is the value of the Interquartile range? 


c. What values would be outliers, using the definition of an outlier on page 
135? Determine if there are any outliers, and if so, which values are outliers. 


d. Construct a boxplot for this data set. 


e. Discuss which picture is more useful for this data set: the boxplot from part 
d, or the histogram in Figure 7.4. 


In each of the following cases, would the mean or the median probably be 
higher, or would they be about equal? 


a. Salaries in a company employing 100 factory workers and 2 highly paid 
executives. 


b. Ages at which residents of a suburban city die, including everything from in- 
fant deaths to the most elderly. 


c. Prices of all new cars sold in 1 month in a large city. 
d. Heights of all 7-year-old children in a large city. 
e. Shoe sizes of adult women. 


Suppose an advertisement reported that the mean weight loss after using a cer- 
tain exercise machine for 2 months was 10 pounds. You investigate further and 
discover that the median weight loss was 3 pounds. 


*a, Explain whether it is most likely that the weight losses were skewed to the 
right, skewed to the left, or symmetric. 


14. 


*15, 


16. 


17. 


18. 


19. 


20. 


*21. 
22. 
23. 


24. 


*25. 


26. 
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*b. As a consumer trying to decide whether to buy this exercise machine, would 
it have been more useful for the company to give you the mean or the me- 
dian? Explain. 

Construct an example and draw a histogram for a measurement that you think 
would be bell-shaped. 


Construct an example and draw a histogram for a measurement that you think 
would be skewed to the right. 

Construct an example and draw a histogram for a measurement that you think 
would be bimodal. 

Give an example of a measurement for which the mode would be more useful 
than the median or the mean as an indicator of the “typical” value. 

Explain the following statement in words that someone with no training in sta- 
tistics would understand: The heights of adult males in the United States are 
bell-shaped, with a mean of about 70 inches and a standard deviation of about 
3 inches. 

Suppose a set of test scores is approximately bell-shaped, with a mean of 70 and 
a range of 50. Approximately, what would the minimum and maximum test 
scores be? 


Three types of pictures were presented in this chapter: stemplots, histograms, 
and boxplots. Explain the features of a data set for which 


a. Stemplots are most useful 

b. Histograms are most useful 

c. Boxplots are most useful 

Would outliers more heavily influence the range or the quartiles? Explain. 
What is the variance for the Stanford-Binet IQ test? 


Give one advantage a stemplot has over a histogram and one advantage a his- 
togram has over a stemplot. 


Find a set of data of interest to you, such as rents from a newspaper or test scores 
from a class, with at least 12 numbers. Include the data with your answer. 


a. Create a five-number summary of the data. 
b. Create a boxplot of the data. 


c. Describe the data in a paragraph that would be useful to someone with no 
training in statistics. 


Which set of data is more likely to have a bimodal shape: daily New York City 
temperatures at noon for the summer months or daily New York City tempera- 
tures at noon for an entire year? Explain. 


Suppose you had a choice of two professors for a class in which your grade was 
very important. They both assign scores on a percentage scale (0 to 100). You 
can have access to three summary measures of the last 200 scores each profes- 
sor assigned. Of the summary measures discussed in this chapter, which three 
would you choose? Why? 
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27. Draw a boxplot illustrating a data set with each of the following features: 


a. Skewed to the right with no outliers. 
b. Bell-shaped with the exception of one outlier at the upper end. 


c. Values uniformly spread across the range of the data. 


*28. The students surveyed for the data in Example 4 were also asked “How many 


29. 


*30. 


31. 


32. 


alcoholic beverages do you consume in a typical week?” Five-number sum- 
maries for males’ and females’ responses are 


Males Females 


2 0 


0 55 0 17.5 


a. Draw side-by-side skeletal boxplots for the data. 


b. Are the values within each set skewed to the right, bell-shaped, or skewed to 
the left? Explain how you know. 


*c, In each case, would the mean be higher, lower, or about the same as the me- 
dian? Explain how you know. 


Refer to the previous exercise. Students were also asked if they typically sit in 
the front, back, or middle of the classroom. Here are the responses to the ques- 
tion about alcohol consumption for the students who responded that they typi- 
cally sit in the back of the classroom: 


Males (N = 22): 0, 0, 0, 0, 0, 0, 0, 1, 3, 3, 4, 5, 10, 10, 10, 14, 15, 15, 
20, 30, 45, 55 
Females (N = 14): 0, 0, 0, 0, 0, 1, 2, 2, 4, 4, 10, 12, 15, 17.5 
a. Create a five-number summary for the males and compare it to the one for 
all of the males in the class, shown in the previous exercise. What does this 


say about the relationship between where one sits in the classroom and 
drinking alcohol? 


b. Repeat part a for the females. 
c. Create a stemplot for the males and comment on its shape. 
d. Create a stemplot for the females and comment on its shape. 


Refer to the previous exercise. Find the mean and median number of drinks for 
males. Which one is a better representation of how much a typical male who sits 
in the back of the room drinks? Explain. 


Refer to the data in Exercise 29. Using the definition of outliers on page 135, 
identify which value(s) are outliers in each of the two sets of values (Males and 
Females). 


The Winters [CA] Express on October 30, 2003, reported that the seasonal rain- 
fall (since July 1) for the year was 0.36 inches, and that the “Normal to October 


33. 


cp 34. 
Gop *35. 
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28" rainfall is 1.14 inches. Does this mean that the area received abnormally low 
rainfall in the period from July 1 to October 28, 2003? Explain. 


According to the National Weather Service, there is about a 10% chance that to- 
tal annual rainfall for Sacramento, CA will be less than 11.1 inches and a 20% 
chance that it will be less than 13.5 inches. At the upper end, there is about a 
10% chance that it will exceed 29.8 inches and a 20% chance that it will exceed 
25.7 inches. The average amount is about 19 inches. In the 2001 year (July 1, 
2000—June 30, 2001) the area received about 14.5 inches of rain. Write two news 
reports of this fact, one that conveys an accurate comparison to other years and 
one that does not. 


Refer to Original Source 5 on the CD, “Distractions in everyday driving.” No- 
tice that on page 86 of the report, when the responses are summarized for the 
quantitative data in Question 8, only the mean is provided. But for Questions 7 
and 9 the mean and median are provided. Why do you think the median is pro- 
vided in addition to the mean for these two questions? 


Refer to Original Source 20 on the CD, “Organophosphorus pesticide exposure 
of urban and suburban preschool children with organic and conventional diets.” 
In Table 4 on page 381, information is presented for estimated dose levels of 
various pesticides for children who eat organic versus conventional produce. 
Find the data for malathion. Assume the minimum exposure for both groups is 
0. Otherwise, all of the information you need for a five-number summary is pro- 
vided. (Note that the percentiles are the values with that percent of the data at or 
below the value. For instance, the median is the 50th percentile.) 


*a, Create a five-number summary for the malathion exposure values for each 
group (organic and conventional). 
b. Construct side-by-side skeletal boxplots for the malathion exposure values 
for the two groups. Write a few sentences comparing them. 


c. Notice that in each case the mean is higher than the median. Explain why this 
is the case. 


Mini-Projects 


1. 


Find a set of data that has meaning for you. Some potential sources are the In- 
ternet, the sports pages, and the classified ads. Using the methods given in this 
chapter, summarize and display the data in whatever ways are most useful. Give 
a written description of interesting features of the data. 


Measure your pulse rate 25 times over the next few days, but don’t take more 
than one measurement in any 10-minute period. Record any unusual events re- 
lated to the measurements, such as if one was taken during exercise or one was 
taken immediately upon awakening. Create a stemplot and a five-number sum- 
mary of your measurements. Give a written assessment of your pulse rate based 
on the data. 
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CHAPTER 


Bell-Shaped Curves 
and Other Shapes 


Thought Questions 


1. 


The heights of adult women in the United States follow, at least approximately, a bell- 
shaped curve. What do you think this means? 


. What does it mean to say that a man’s weight is in the 30th percentile for all adult 


males? 


. A “standardized score” is simply the number of standard deviations an individual 


score falls above or below the mean for the whole group. (Values above the mean 
have positive standardized scores, whereas those below the mean have negative 
ones.) Male heights have a mean of 70 inches and a standard deviation of 3 inches. 
Female heights have a mean of 65 inches and a standard deviation of 2 1/2 inches. 
Thus, a man who is 73 inches tall has a standardized score of 1. What is the stan- 
dardized score corresponding to your own height? 


. Data sets consisting of physical measurements (heights, weights, lengths of bones, 


and so on) for adults of the same species and sex tend to follow a similar pattern. 
The pattern is that most individuals are clumped around the average, with numbers 
decreasing the farther values are from the average in either direction. Describe what 
shape a histogram of such measurements would have. 
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8.1 Populations, Frequency Curves, and Proportions 


Figure 8.1 
A normal frequency 
curve 


In Chapter 7, we learned how to draw a picture of a set of data and how to think 
about its shape. In this chapter, we learn how to extend those ideas to pictures and 
shapes for populations of measurements. For example, in Figure 7.5 we illustrated 
that, based on a sample of 199 men, heights of adult British males are reasonably 
bell-shaped. Because the men were a representative sample, the picture for all of the 
millions of British men is probably similar. But even if we could measure them all, 
it would be difficult to construct a histogram with so much data. What is the best way 
to represent the shape of a large population of measurements? 


Frequency Curves 


The most common type of picture for a population is a smooth frequency curve. 
Rather than drawing lots of tiny rectangles, the picture is drawn as if the tops of the 
rectangles were connected with a smooth curve. Figure 8.1 illustrates a frequency 
curve for the population of British male heights. Notice that the picture is similar to 
the histogram in Figure 7.5, except that the curve is smooth and the heights have 
been converted to inches. 

Notice that the vertical scale is simply labeled “height of curve.” This height is 
determined by sizing the curve so that the area under the entire curve is 1, for rea- 
sons that will become clear in the next few pages. Unlike with a histogram, the 
height of the curve cannot be interpreted as a proportion or frequency, but is chosen 
simply to satisfy the rule that the entire area under the curve is 1. 

The bell shape illustrated in Figure 8.1 is so common that if a population has this 
shape, the measurements are said to follow a normal distribution. Equivalently 
they are said to follow a bell-shaped curve, a normal curve, or a Gaussian curve. 
This last name comes from the name of Karl Friedrich Gauss (1777-1855), who was 
one of the first mathematicians to investigate the shape. 
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Figure 8.2 
A nonnormal frequency 
curve 
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Claims in thousands of dollars 


Not all frequency curves are bell-shaped. Figure 8.2 shows a likely frequency 
curve for the population of dollar amounts of car insurance damage claims for a 
Midwestern city in the United States, based on data from 187 claims in that city in 
the early 1990s (Ott and Mendenhall, 1994). Notice that the curve is skewed to the 
right. Most of the claims were below $12,000, but occasionally there was an ex- 
tremely high claim. For the remainder of this chapter, we focus on bell-shaped 
curves. 


Proportions 


Frequency curves are quite useful for determining what proportion or percentage of 
the population of measurements falls into a certain range. If we wanted to find out 
what percentage of the data fell into any particular range with a stemplot, we would 
count the number of leaves that were in that range and divide by the total. If we 
wanted to find the percentage in a certain range using a histogram, we would simply 
add up the heights of the rectangles for that range, assuming we had used propor- 
tions instead of counts for the heights. If not, we would add up the counts for that 
range and divide by the total number in the sample. 

What if we have a frequency curve instead of a stemplot or histogram? Fre- 
quency curves are, by definition, drawn to make it easy to represent the proportion 
of the population falling into a certain range. Recall that they are drawn so the entire 
area underneath the curve is 1, or 100%. Therefore, to figure out what percentage or 
proportion of the population falls into a certain range, all you have to do is figure out 
how much of the area is situated over that range. For example, in Figure 8.1, half of 
the area is in the range above the mean height of 68.25 inches. In other words, about 
half of all British men are 68.25 inches or taller. 

Although it is easy to visualize what proportion of a population falls into a cer- 
tain range using a frequency curve, it is not as easy to compute that proportion. For 
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anything but very simple cases, the computation to find the required area involves 
the use of calculus. However, because bell-shaped curves are so common, tables 
have been prepared in which the work has already been done (see, for example, 
Table 8.1 at the end of this chapter), and many calculators and computer applications 
such as Excel will compute these proportions. 


8.2 The Pervasiveness of Normal Curves 


Nature provides numerous examples of populations of measurements that, at least 
approximately, follow a normal curve. If you were to create a picture of the shape of 
almost any physical measurement within a homogeneous population, you would 
probably get the familiar bell shape. In addition, many psychological attributes, such 
as IQ, are normally distributed. Many standard academic tests, such as the Scholas- 
tic Assessment Test (SAT), if given to a large group, will result in normally distrib- 
uted scores. 

The fact that so many different kinds of measurements all follow approximately 
the same shape should not be surprising. The majority of people are somewhere 
close to average on any attribute, and the farther away you move from the average, 
either above or below, the fewer people will have those more extreme values for their 
measurements. 

Sometimes a set of data is distorted to make it fit a normal curve. That’s what 
happens when a professor “grades on a bell-shaped curve.” Rather than assign the 
grades students have actually earned, the professor distorts them to make them fit 
into a normal curve, with a certain percentage of A’s, B’s, and so on. In other words, 
grades are assigned as if most students were average, with a few good ones at the top 
and a few bad ones at the bottom. Unfortunately, this procedure has a tendency to ar- 
tificially spread out clumps of students who are at the top or bottom of the scale, so 
that students whose original grades were very close together may receive different 
letter grades. 


8.3 Percentiles and Standardized Scores 


Percentiles 


Have you ever wondered what percentage of the population of your sex is taller than 
you are, or what percentage of the population has a lower IQ than you do? Your per- 
centile in a population represents the position of your measurement in comparison 
with everyone else’s. It gives the percentage of the population that falls below you. 
If you are in the 50th percentile, it means that exactly half of the population falls be- 
low you. If you are in the 98th percentile, 98% of the population falls below you and 
only 2% is above you. 
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Your percentile is easy to find if the population of values has an approximate bell 
shape and if you have just three pieces of information. All you need to know are your 
own value and the mean and standard deviation for the population. Although there 
are obviously an unlimited number of potential bell-shaped curves, depending on the 
magnitude of the particular measurements, each one is completely determined once 
you know its mean and standard deviation. In addition, each one can be “standard- 
ized” in such a way that the same table can be used to find percentiles for any of 
them. 


Standardized Scores 


Suppose you knew your IQ was 116, as measured by the Stanford-Binet IQ test. 
Scores from that test have a normal distribution with a mean of 100 and a standard 
deviation of 16. Therefore, your IQ is exactly 1 standard deviation above the mean 
of 100. In this case, we would say you have a standardized score of 1. In general, a 
standardized score simply represents the number of standard deviations the ob- 
served value or score falls from the mean. A positive standardized score indicates an 
observed value above the mean, whereas a negative standardized score indicates a 
value below the mean. Someone with an IQ of 84 would have a standardized score 
of —1 because he or she would be exactly 1 standard deviation below the mean. 
Sometimes the abbreviated term standard score is used instead of “standardized 
score.” The letter z is often used to represent a standardized score, so another syn- 
onym is z-score. 

Once you know the standardized score for an observed value, all you need to find 
the percentile is the appropriate table, one that gives percentiles for a normal distri- 
bution with a mean of 0 and a standard deviation of 1. A normal curve with a mean 
of 0 and a standard deviation of 1 is called a standard normal curve. It is the curve 
that results when any normal curve is converted to standardized scores. In other 
words, the standardized scores resulting from any normal curve will have a mean of 
0 and a standard deviation of 1 and will retain the bell shape. 

Table 8.1, presented at the end of this chapter, gives percentiles for standardized 
scores. For example, with an IQ of 116 and a standardized score of +1, you would 
be at the 84th percentile. In other words, your IQ would be higher than that of 84% 
of the population. If we are told the percentile for a score but not the value itself, we 
can also work backward from the table to find the value. Let’s review the steps nec- 
essary to find a percentile from an observed value, and vice versa. 


To find the percentile from an observed value: 


1. Find the standardized score: (observed value — mean)/s.d., where s.d. = 
standard deviation. Don’t forget to keep the plus or minus sign. 


2. Look up the percentile in Table 8.1 (page 157). 
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EXAMPLE 1 


EXAMPLE 2 


EXAMPLE 3 


To find an observed value from a percentile: 
1. Look up the percentile in Table 8.1 and find the corresponding 
standardized score. 


2. Compute the observed value: mean + (standardized score)(s.d), where 
s.d. = standard deviation. 


Tragically Low IQ 


In the Edinburgh newspaper the Scotsman on March 8, 1994, a headline read, “Jury 
urges mercy for mother who killed baby” (p. 2). The baby had died from improper care. 
One of the issues in the case was that “the mother... had an IQ lower than 98 percent 
of the population, the jury had heard.” From this information, let's compute the mother's 
IQ. If it was lower than 98% of the population, it was higher than only 2%, so she was 
in the 2nd percentile. From Table 8.1, we see that her standardized score was —2.05, or 
2.05 standard deviations below the mean of 100. We can now compute her IQ: 


ll 


observed value = mean + (standardized score)(s.d.) 
observed value = 100 + (—2.05)(16) 
observed value = 100 + (—32.8) = 100 — 32.8 


67.2 


observed value 


Thus, her IQ was about 67. The jury was convinced that her IQ was, tragically, too low 
to expect her to be a competent mother. E 


Calibrating Your GRE Score 


The Graduate Record Examination (GRE) is a test taken by college students who intend 
to pursue a graduate degree in the United States. For all college seniors and graduates 
who took the exam between October 1, 1989, and September 30, 1992, the mean for 
the verbal ability portion of the exam was about 497 and the standard deviation was 
115 (Educational Testing Service, 1993). If you had received a score of 650 on that GRE 
exam, what percentile would you be in, assuming the scores were bell-shaped? We can 
compute your percentile by first computing your standardized score: 


standardized score = (observed value — mean)/(s.d.) 
standardized score = (650 — 497)/115 
153/115 = 1.33 


standardized score 


From Table 8.1, we see that a standardized score of 1.33 is between the 90th percentile 
score of 1.28 and the 91st percentile score of 1.34. In other words, your score was 
higher than about 90% of the population. Figure 8.3 illustrates the GRE score of 650 
for the population of GRE scores and the corresponding standardized score of 1.3 for 
the standard normal curve. Notice the similarity of the two pictures. oO 


lan Stewart (17 September 1994, p. 14) reported on a problem posed to a statistician 
by a British company called Molegon, whose business was to remove unwanted moles 


Figure 8.3 

The 90th percentile for 
GRE scores and 
standardized scores 
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Standardized scores 


from gardens. The company kept records indicating that the population of weights of 
moles in its region was approximately normal, with a mean of 150 grams and standard 
deviation of 56 grams. The European Union announced that starting in 1995, only moles 
weighing between 68 grams and 211 grams can be legally caught. Molegon wanted to 
know what percentage of all moles could be legally caught. 

To solve this problem, we need to know what percentage of all moles weigh be- 
tween 68 grams and 211 grams. We need to find two standardized scores, one for each 
end of the interval, and then find the percentage of the curve that lies between them: 


standardized score for 68 grams = (68 — 150)/56 = —1.46 

standardized score for 211 grams = (211 — 150)/56 = 1.09 
From Table 8.1, we see that about 86% of all moles weigh 211 grams or less. But we 
also see that about 7% are below the legal limit of 68 grams. Therefore, about 86% — 


7% = 79% are within the legal limits. Of the remaining 21%, 14% are too big to be 
legal and 7% are too small. Figure 8.4 illustrates this situation. a 
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Figure 8.4 
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8.4 Scores and Familiar Intervals 


Any educated consumer of statistics should know a few facts about normal curves. 
First, as mentioned already, a synonym for a standardized score is a z-score. Thus, if 
you are told that your z-score on an exam is 1.5, it means that your score is 1.5 stan- 
dard deviations above the mean. You can use that information to find your approxi- 
mate percentile in the class, assuming the scores are approximately bell-shaped. 
Second, some easy-to-remember intervals can give you a picture of where val- 
ues on any normal curve will fall. This information is known as the Empirical Rule. 


Empirical Rule 

For any normal curve, approximately 
68% of the values fall within 1 standard deviation of the mean in either 
direction 


95% of the values fall within 2 standard deviations of the mean in either 
direction 


99.7% of the values fall within 3 standard deviations of the mean in ei- 
ther direction 


A measurement would be an extreme outlier if it fell more than 3 standard deviations 
above or below the mean. You can see why the standard deviation is such an impor- 
tant measure. If you know that a set of measurements is approximately bell-shaped, 
and you know the mean and standard deviation, then even without a table like Table 
8.1, you can say a fair amount about the magnitude of the values. 


Figure 8.5 
The Empirical Rule for 
heights of adult women 
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Heights of women 


For example, because adult women in the United States have a mean height of 
about 65 inches (5 feet 5 inches) with a standard deviation of about 2.5 inches, and 
heights are bell-shaped, we know that approximately 


m 68% of adult women in the United States are between 62.5 inches and 67.5 inches 
m 95% of adult women in the United States are between 60 inches and 70 inches 
m 99.7% of adult women in the United States are between 57.5 inches and 72.5 inches 


Figure 8.5 illustrates the Empirical Rule for the heights of adult women in the United 
States. 

The mean height for adult males in the United States is about 70 inches and the 
standard deviation is about 3 inches. You can easily compute the ranges into which 
68%, 95%, and almost all men’s heights should fall. 


Using Computers to Find Normal Curve Proportions 


There are computer programs and Web sites that will find the proportion of a 
normal curve that falls below a specified value, above a value, and between 
two values. For example, here are two useful Excel functions: 


NORMSDIST(value) provides the proportion of the standard normal 
curve below the value. Example: NORMSDIST(1) = .8413, which 
rounds to .84, shown in Table 8.1 for z = 1. 


NORMDIST(value,mean,s.d.,1) provides the proportion of a normal 
curve with the specified mean and standard deviation (s.d.) that lies be- 
low the value given. (If the last number in parentheses is 0 instead of 1, 
it gives you the height of the curve at that value, which isn’t much use to 
you. The “1” tells it that you want the proportion below the value.) Ex- 
ample: NORMDIST(67.5,65,2.5,1) = .8413, representing the proportion 
of adult women with heights of 67.5 inches or less. 


156 PART2 Finding Life in Data 


For Those Who Like Formulas 


Notation for a Population 


The lowercase Greek letter “mu” (u) represents the population mean. 


The lowercase Greek letter “sigma” (o) represents the population standard 
deviation. 


Therefore, the population variance is represented by o°. 


A normal distribution with a mean of u and variance of a is denoted by 
Np, o’). 


For example, the standard normal distribution is denoted by N(0, 1). 


Standardized Score z for an Observed Value x 


_x7H 


oO 


Observed Value x for a Standardized Score z 


X= fh zo 


Empirical Rule 
If a population of values is N(u, o°), then approximately: 
68% of values fall within the interval u + o 


95% of values fall within the interval u + 20 


99.7% of values fall within the interval u + 30 


Exercises 


Asterisked (*) exercises are included in the Solutions at the back of the book. 


1. Using Table 8.1, a computer, or a calculator, determine the percentage of the 
population falling below each of the following standard scores: 


a. —1.00 
b. 1.96 
c. 0.84 


*2. Using Table 8.1, a computer, or a calculator, determine the percentage of the 
population falling above each of the following standard scores: 


*a, 1.28 
*b, —0.25 
"e, 2:33 
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Table 8.1 Proportions and Percentiles for Standard Normal Scores 


Standard Proportion Standard Proportion 

Score, z Below z Percentile Score, z Below z Percentile 
—6.00 0.000000001 0.0000001 0.03 0.51 51 
—5.20 0.0000001 0.00001 0.05 0.52 52 
—4.26 0.00001 0.001 0.08 0.53 53 
—3.00 0.0013 0.13 0.10 0.54 54 
—2.576 0.005 0.50 0.13 0.55 55 
—2.33 0.01 1 0.15 0.56 56 
=205 0.02 2 0.18 0:57 57 
—1.96 0.025 2.5 0.20 0.58 58 
—1.88 0.03 3 0.23 0.59 59 
=1.75 0.04 4 0.25 0.60 60 
—1.64 0.05 5 0.28 0.61 61 
=1.55 0.06 6 0.31 0.62 62 
—1.48 0.07 7 0.33 0.63 63 
—1.41 0.08 8 0.36 0.64 64 
—1.34 0.09 9 0.39 0.65 65 
—1.28 0.10 0) 0.41 0.66 66 
—1.23 0.11 1 0.44 0.67 67 
=1.17 0:12 2 0.47 0.68 68 
=1:13 0.13 3 0.50 0.69 69 
—1.08 0.14 4 0.52 0.70 70 
—1.04 0.15 5 0.55 0.71 71 
—1.00 0.16 6 0.58 0.72 72 
=0.95 0.17 7 0.61 0.73 73 
—0.92 0.18 8 0.64 0.74 74 
—0.88 0.19 9 0.67 0:75 75 
—0.84 0.20 20 0.71 0:76 76 
—0.81 0:21 21 0.74 0.77 ii 
=O77 0.22 22 0.77 0:78 78 
—0.74 0.23 23 0.81 0.79 79 
=O 1 0.24 24 0.84 0.80 80 
—0.67 0.25 25 0.88 0.81 81 
—0.64 0.26 26 0.92 0.82 82 
—0.61 0.27 27 0.95 0.83 83 
—0.58 0.28 28 1.00 0.84 84 
=0.55 0.29 29 1.04 0.85 85 
—0.52 0.30 30 08 0.86 86 
—0.50 0.31 31 l3 0.87 87 
—0.47 0.32 32 1:17 0.88 88 
—0.44 0.33 33 1:23 0.89 89 
—0.41 0.34 34 .28 0.90 90 
—0.39 0.35 35 .34 0.91 91 
—0.36 0.36 36 1.41 0.92 92 
—0.33 0.37 37 1.48 0.93 93 
—0.31 0.38 38 :55 0.94 94 
—0.28 0.39 39 .64 0.95 95 
—0.25 0.40 40 1:75 0.96 96 
—0.23 0.41 41 1.88 0.97 97 
—0.20 0.42 42 .96 0.975 97.5 
—0.18 0.43 43 2.05 0.98 98 
=0.15 0.44 44 2.33 0.99 99 
=0:13 0.45 45 2.576 0.995 99.5 
—0.10 0.46 46 3.00 0.9987 99.87 
—0.08 0.47 47 3.75 0.9999 99.99 
—0.05 0.48 48 4.26 0.99999 99.999 
—0.03 0.49 49 5.20 0.9999999 99.99999 


0.00 0.50 50 6.00 0.999999999 99.9999999 
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#4, 


*8. 


Using Table 8.1, a computer, or a calculator, determine the standard score that 
has the following percentage of the population below it: 


a. 25% 
b. 75% 
c. 45% 
d. 98% 


Using Table 8.1, a computer, or a calculator, determine the standard score that 
has the following percentage of the population above it: 


*a, 2% 
b. 50% 
te. 75% 
d. 10% 


. Using Table 8.1, a computer, or a calculator, determine the percentage of the 


population falling between the two standard scores given: 
a. —1.00 and 1.00 

b. —1.28 and 1.75 

c. 0.0 and 1.00 


. The 84th percentile for the Stanford-Binet IQ test is 116. (Recall that the mean 


is 100 and the standard deviation is 16.) 

a. Verify that this is true by computing the standardized score and using 
Table 8.1. 

b. Draw pictures of the original and standardized scores to illustrate this situa- 
tion, similar to the pictures in Figure 8.3. 


. Draw a picture of a bell-shaped curve with a mean value of 100 and a standard 


deviation of 10. Mark the mean and the intervals derived from the Empirical 
Rule in the appropriate places on the horizontal axis. You do not have to mark 
the vertical axis. Use Figure 8.5 as a guide. 


Find the percentile for the observed value in the following situations: 
*a. GRE score of 450 (mean = 497, s.d. = 115). 

b. Stanford-Binet IQ score of 92 (mean = 100, s.d. = 16). 

c. Woman’s height of 68 inches (mean = 65 inches, s.d. = 2.5 inches). 


. Mensa is an organization that allows people to join only if their IQs are in the 


top 2% of the population. 


a. What is the lowest Stanford-Binet IQ you could have and still be eligible to 
join Mensa? (Remember that the mean is 100 and the standard deviation 
is 16.) 

b. Mensa also allows members to qualify on the basis of certain standard tests. 
If you were to try to qualify on the basis of the GRE exam, what score would 
you need on the exam? (Remember that the mean is 497 and the standard de- 
viation is 115.) 


*10. 


11. 


*12. 


13. 


14. 


*15. 


16. 


17. 
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Every time you have your cholesterol measured, the measurement may be 
slightly different due to random fluctuations and measurement error. Suppose 
that for you the population of possible cholesterol measurements if you are 
healthy has a mean of 190 and a standard deviation of 10. Further, suppose you 
know you should get concerned if your measurement ever gets up to the 97th 
percentile. What level of cholesterol does that represent? 


Use Table 8.1 to verify that the Empirical Rule is true. You may need to round 
off the values slightly. 


Recall from Chapter 7 that the interquartile range covers the middle 50% of the 
data. For a bell-shaped population: 


*a, The interquartile range covers what range of standardized scores? In other 
words, what are the standardized scores for the lower and upper quartiles? 
(Hint: Draw a standard normal curve and locate the 25th and 75th percentiles 
using Table 8.1.) 


b. How many standard deviations are covered by the interquartile range? 


c. The whiskers on a boxplot can extend a total of 2 interquartile ranges on ei- 
ther side of the median, which for a bell-shaped population is equal to the 
mean. (They can extend 1.5 IQR outside of the box but the distance between 
the median/mean and end of the box is an additional 0.5 IQR.) Beyond that 
range, data values are considered to be outliers. In other words, for bell- 
shaped populations, data values are outliers if they are more than 2 IQRs 
away from the mean. At what percentiles (at the upper and lower ends) are 
data values considered to be outliers for bell-shaped populations? 


Give an example of a population of measurements that you do not think has a 
normal curve, and draw its frequency curve. 


A graduate school program in English will admit only students with GRE ver- 
bal ability scores in the top 30%. What is the lowest GRE score it will accept? 
(Recall the mean is 497 and the standard deviation is 115.) 


Recall that for Stanford-Binet IQ scores the mean is 100 and the standard devi- 
ation is 16. 


*a. Use the Empirical Rule to specify the ranges into which 68%, 95%, and 
99.7% of Stanford-Binet IQ scores fall. 


b. Draw a picture similar to Figure 8.5 for Stanford-Binet scores, illustrating 
the ranges from part a. 


For every 100 births in the United States, the number of boys follows, approxi- 
mately, a normal curve with a mean of 51 boys and standard deviation of 5 boys. 
If the next 100 births in your local hospital resulted in 36 boys (and thus 64 
girls), would that be unusual? Explain. 


Suppose a candidate for public office is favored by only 48% of the voters. If a 
sample survey randomly selects 2500 voters, the percentage in the sample who 
favor the candidate can be thought of as a measurement from a normal curve 
with a mean of 48% and a standard deviation of 1%. Based on this information, 
how often would such a survey show that 50% or more of the sample favored 
the candidate? 
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*18. 


19. 


*20. 


21. 


22. 


23. 


*24. 


25. 


Suppose you record how long it takes you to get to work or school over many 
months and discover that the times are approximately bell-shaped with a mean 
of 15 minutes and a standard deviation of 2 minutes. How much time should you 
allow to get there to make sure you are on time 90% of the time? 


Assuming heights for each sex are bell-shaped, with means of 70 inches for men 
and 65 inches for women, and with standard deviations of 3 inches for men and 
2.5 inches for women, what proportion of your sex is shorter than you are? (Be 
sure to mention your sex and height in your answer!) 


According to Chance magazine ([1993], 6, no. 3, p. 5), the mean healthy adult 
temperature is around 98.2° Fahrenheit, not the previously assumed value of 
98.6°. Suppose the standard deviation is 0.6 degree and the population of 
healthy temperatures is bell-shaped. 


*a. What proportion of the population have temperatures at or below the pre- 
sumed norm of 98.6°? 


*b. Would it be accurate to say that the normal healthy adult temperature is 98.2° 
Fahrenheit? Explain. 


Remember from Chapter 7 that the range for a data set is found as the differ- 
ence between the maximum and minimum values. Explain why it makes sense 
that for a bell-shaped data set of a few hundred values, the range should be about 
4 to 6 standard deviations. 


Suppose that you were told that scores on an exam in a large class you are tak- 
ing ranged from 50 to 100 and that they were approximately bell-shaped. 


a. Estimate the mean for the exam scores. 


b. Refer to the result about the relationship between the range and standard de- 
viation in the previous exercise. Estimate the standard deviation for the exam 
scores, using that result and the information in this problem. 


c. Suppose your score on the exam was 80. Explain why it is reasonable to as- 
sume that your standardized score is about 0.5. 


d. Based on the standardized score in part c, about what proportion of the class 
scored higher than you did on the exam? 


Recall that GRE scores are approximately bell-shaped with a mean of 497 and 
standard deviation of 115. The minimum and maximum possible scores on the 
GRE exam are 200 and 800, respectively. 


a. What is the range for GRE scores? 


b. Refer to the result about the relationship between the range and standard de- 
viation in Exercise 21. Does the result make sense for GRE scores? Explain. 


Over many years, rainfall totals for Sacramento, CA in January ranged from a 
low of about 0.05 inch to a high of about 19.5 inches. The median was about 3.1 
inches. Based on this information, explain how you can tell that the distribution 
of rainfall values in Sacramento in January cannot be bell-shaped. 


Math SAT scores for students admitted to a university are bell-shaped with a 
mean of 520 and a standard deviation of 60. 
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a. Draw a picture of these SAT scores, indicating the cutoff points for the mid- 
dle 68%, 95%, and 99.7% of the scores. 


b. A student had a math SAT score of 490. Find the standardized score for this 
student and draw where her score would fall on your picture in part a. 


References 


Educational Testing Service. (1993). GRE 1993-94 guide. Princeton, NJ: Educational Testing 
Service. 

Ott, R. L., and W. Mendenhall. (1994). Understanding statistics, 6th ed. Belmont, CA: 
Duxbury Press. 

Stewart, Ian. (17 September 1994). Statistical modelling. New Scientist: Inside Science 74, 
p. 14. 


162 


Plots, Graphs, and Pictures 


Thought Questions 


1. You have seen pie charts and bar graphs and should have some rudimentary idea of 


how to construct them. Suppose you have been keeping track of your living expenses 
and find that you spend 50% of your money on rent, 25% on food, and 25% on 
other expenses. Draw a pie chart and a bar graph to depict this information. Discuss 
which is more visually appealing and useful. 


. Here is an example of a plot that has some problems. Give two reasons why this is 


not a good plot. 


Domestic Water Production 
1968-1992 


Production 


70-71 74-75 78-79 82-83 86-87 90-91 
Fiscal year 


. Suppose you had a set of data representing two measurement variables—namely, 


height and weight—for each of 100 people. How could you put that information 
into a plot, graph, or picture that illustrated the relationship between the two mea- 
surements for each person? 


. Suppose you own a company that produces candy bars and you want to display two 


graphs. One graph is for customers and shows the price of a candy bar for each of 
the past 10 years. The other graph is for stockholders and shows the amount the 
company was worth for each of the past 10 years. You decide to adjust the dollar 
amounts in one graph for inflation but to use the actual dollar amounts in the other 
graph. If you were trying to present the most favorable story in each case, which 
graph would be adjusted for inflation? Explain. 
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9.1 Well-Designed Statistical Pictures 


There are many ways to present data in pictures. The most common are plots and 
graphs, but sometimes a unique picture is used to fit a particular situation. The pur- 
pose of a plot, graph, or picture of data is to give you a visual summary that is more 
informative than simply looking at a collection of numbers. Done well, a picture can 
quickly convey a message that would take you longer to find if you had to study the 
data on your own. Done poorly, a picture can mislead all but the most observant of 
readers. Here are some basic characteristics that all plots, graphs, and pictures 
should exhibit: 


1. The data should stand out clearly from the background. 


2. There should be clear labeling that indicates 
a. the title or purpose of the picture. 
b. what each of the axes, bars, pie segments, and so on, denotes. 
c. the scale of each axis, including starting points. 


3. A source should be given for the data. 


4. There should be as little “chart junk”—that is, extraneous material—in the pic- 
ture as possible. 


9.2 Pictures of Categorical Data 


Categorical data are easy to represent with pictures. The most frequent use of such 
data is to determine how the whole divides into categories, and pictures are useful in 
expressing that information. Let’s look at three common types of pictures for cate- 
gorical data and their uses. 


Pie Charts 


Pie charts are useful when only one categorical variable is measured. Pie charts 
show what percentage of the whole falls into each category. They are simple to un- 
derstand, and they convey information about the relative size of groups more read- 
ily than a table. Figure 9.1 shows a pie chart that represents the percentage of 
Caucasian American children who have various hair colors. 


Bar Graphs 


Bar graphs also show percentages or frequencies in various categories, but they can 
be used to represent two or three categorical variables simultaneously. One categor- 
ical variable is used to label the horizontal axis. Within each of the categories along 
that axis, a bar is drawn to represent each category of the second variable. Frequen- 
cies or percentages are shown on the vertical axis. A third variable can be included 
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Figure 9.1 

Pie chart of hair colors 
of Caucasian American 
children 

Source: Krantz, 1992, p. 188. 


Figure 9.2 

Percentage of males and 
females 16 and over in 
the labor force 


Source: Based on data from 
U.S. Dept. of Labor, Bureau of 
Labor Statistics, Current 
Population Survey. 


if the graph has only two categories by using percentages on the vertical axis. One 
category is shown, and the other is implied by the fact that the total must be 100%. 

For example, Figure 9.2 illustrates employment trends for men and women 
across decades. The year in which the information was collected is one categorical 
variable, represented by the horizontal axis. In each year, people were categorized 
according to two additional variables: whether they were in the labor force and 
whether they were male or female. Separate bars are drawn for males and females, 
and the percentage in the labor force determines the heights of the bars. It is implicit 
that the remainder were not in the labor force. Respondents were part of the Bureau 
of Labor Statistics’ Current Population Survey, the large monthly survey used to de- 
termine unemployment rates. 

The decision about which variable occupies which position should be made to 
better convey visually the purpose for the graph. The purpose of the graph in Fig- 
ure 9.2 is to illustrate that the percentage of women in the labor force has increased 
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Figure 9.3 

Two pictograms 
showing percentages of 
Ph.D.s earned by 
women 


Source: Alper, 16 April 1993, 
p. 409. 
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since 1950, whereas the percentage of men has decreased slightly, resulting in the 
two percentages coming closer together. The gap in 1950 was 53 percentage points, 
but by 2000 it was less than 15 percentage points, as is illustrated by the graph. 
Bar graphs are not always as visually appealing as pie charts, but they are much 
more versatile. They can also be used to represent actual frequencies instead of per- 
centages and to represent proportions that are not required to sum to 100%. 


Pictograms 


A pictogram is like a bar graph except that it uses pictures related to the topic of the 
graph. Figure 9.3 shows a pictogram illustrating the proportion of Ph.D.s earned by 
women in three fields—psychology (58%), biology (37%), and mathematics 
(18%)—as reported in Science (16 April, 1993, 260, p. 409). Notice that in place of 
bars, the graph uses pictures of diplomas. 

It is easy to be misled by pictograms. The pictogram on the left shows the diplo- 
mas using realistic dimensions. However, it is misleading because the eye tends to 
focus on the area of the diploma rather than just its height. The heights of the three 
diplomas reach the correct proportions, with heights of 58%, 37%, and 18%, so the 
height of the one for psychology Ph.D.s is just over three times the height of the one 
for math Ph.D.s. However, in keeping the proportions realistic, the area of the 
diploma for psychology is about nine times the area of the one for math, leading the 
eye to inflate the difference. 

The pictogram on the right is drawn by keeping the width of the diplomas the 
same for each field. The picture is visually more accurate, but it is less appealing be- 
cause the diplomas are consequently quite distorted in appearance. When you see a 
pictogram, be careful to interpret the information correctly and not to let your eye 
mislead you. 


Percentage of women Ph.D.s 
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Figure 9.4 

Line graph displaying 
winning time versus year 
for men’s 500-meter 
Olympic speed skating 
Source: http://sportsillustrated 
«cnn.com 
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9.3 Pictures of Measurement Variables 


Measurement variables can be illustrated with graphs in numerous ways. We saw 
two ways to illustrate a single measurement variable in Chapter 7—namely, stem- 
plots and histograms. Graphs are most useful for displaying the relationship between 
two measurement variables or for displaying how a measurement variable changes 
over time. Two common types of displays for measurement variables are illustrated 
in Figures 9.4 and 9.5. 


Line Graphs 


Figure 9.4 is an example of a line graph displayed over time. It shows the winning 
times for the men’s 500-meter speed skating event in the Winter Olympics from 
1924 to 2002. Notice the distinct downward trend, with only a few upturns over the 
years. There was a large drop between 1952 and 1956, followed by a period of rela- 
tive stability. These patterns are much easier to detect with a picture than they would 
be by scanning a list of winning times. 


Scatterplots 


Figure 9.5 is an example of a scatterplot. Scatterplots are useful for displaying the 
relationship between two measurement variables. Each dot on the plot represents one 
individual, unless two or more individuals have the same data, in which case only 
one point is plotted at that location. The plot in Figure 9.5 shows the grade point av- 
erages (GPAs) and verbal scholastic achievement test (SAT) scores for a sample of 
100 students at a university in the northeastern United States. 

Although a scatterplot can be more difficult to read than a line graph, it displays 
more information. It shows outliers, as well as the degree of variability that exists for 


Figure 9.5 
Scatterplot of grade 
point average versus 
verbal SAT score 


Source: Ryan, Joiner, and Ryan, 


1985, pp. 309-312. 
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Grade point average 


350 450 550 650 750 
Verbal SAT score 


one variable at each location of the other variable. In Figure 9.5, we can see an in- 
creasing trend toward higher GPAs with higher SAT scores, but we can also still see 
substantial variability in GPAs at each level of verbal SAT scores. A scatterplot is 
definitely more useful than the raw data. Simply looking at a list of the 100 pairs of 
GPAs and SAT scores, we would find it difficult to detect the trend that is so obvi- 
ous in the scatterplot. 


9.4 Difficulties and Disasters in Plots, 
Graphs, and Pictures 


A number of common mistakes appear in plots and graphs that may mislead readers. 
If you are aware of them and watch for them, you will substantially reduce your 
chances of misreading a statistical picture. 


The most common problems in plots, graphs, and pictures are 


1. No labeling on one or more axes 
. Not starting at zero as a way to exaggerate trends 


2 
3. Change(s) in labeling on one or more axes 
4. Misleading units of measurement 

5 


. Using poor information 
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Figure 9.6 
Example of a graph with no labeling (a) and possible interpretations (b and c) 
Source: Insert in the California Aggie (UC Davis), 30 May 1993. 
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(a) Actual graph 


Production 


Domestic Water Production 
1968-1992 
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(b) Axis in “actual graph” starts at zero (c) Axis in “actual graph” does not start at zero 
as a way to exaggerate trends 


No Labeling on One or More Axes 


You should always look at the axes in a picture to make sure they are labeled. Fig- 
ure 9.6a gives an example of a plot for which the units were not labeled on the ver- 
tical axis. The plot appeared in a newspaper insert titled, “May 1993: Water 
awareness month.” When there is no information about the units used on one of the 
axes, the plot cannot be interpreted. To see this, consider Figure 9.6b and c, display- 
ing two different scenarios that could have produced the actual graph in Figure 9.6a. 
In Figure 9.6b, the vertical axis starts at zero for the existing plot. In Figure 9.6c, the 
vertical axis for the original plot starts at 30 and stops at 40, so what appears to be a 
large drop in 1979 in the other two graphs is only a minor fluctuation. We do not 
know which of these scenarios is closer to the truth, yet you can see that the two pos- 
sibilities represent substantially different situations. 


Figure 9.7 

An example of the 
change in perception 
when axes start at zero 
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Not Starting at Zero 


Often, even when the axes are labeled, the scale of one or both of the axes does not 
start at zero, and the reader may not notice that fact. A common ploy is to present an 
increasing or decreasing trend over time on a graph that does not start at zero. As we 
saw for the example in Figure 9.6, what appears to be a substantial change may ac- 
tually represent quite a modest change. Always make it a habit to check the numbers 
on the axes to see where they start. 

Figure 9.7 shows what the line graph of winning times for the Olympic speed 
skating data in Figure 9.4 would have looked like if the vertical axis had started at 
zero. Notice that the drop in winning times over the years does not look nearly as 
dramatic as it did in Figure 9.4. Be very careful about this form of potential decep- 
tion if someone is presenting a graph to display growth in sales of a product, a drop 
in interest rates, and so on. Be sure to look at the labeling, especially on the vertical 
axis. 

Despite this, be aware that for some graphs it makes sense to start the units on 
the axes at values different from zero. A good example is the scatterplot of GPAs ver- 
sus SAT scores in Figure 9.5. It would make no sense to start the horizontal axis 
(SAT scores) at zero because the range of interest is from about 350 to 800. It is the 
responsibility of the reader to notice the units. Never assume a graph starts at zero 
without checking the labeling. 


Changes in Labeling on One or More Axes 


Figure 9.8 shows an example of a graph where a cursory look would lead one to 
think the vertical axis starts at zero. However, notice the white horizontal bar just 
above the bottom of the graph, in which the vertical bars are broken. That indicates 
a gap in the vertical axis. In fact, you can see that the bottom of the graph actually 
corresponds to about 4.0%. It would have been more informative if the graph had 
simply been labeled as such, without the break. 
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Figure 9.8 
A bar graph with gap in 
labeling 


Source: Davis (CA) Enterprise, 
4 March 1994, p. A-7. 
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Figure 9.9 shows a much more egregious example of changes in labeling. Notice 
that the horizontal axis does not maintain consistent distances between years and that 
varying numbers of years are represented by each of the bars. The distance between 
the first and second bars on the left is 8 years, whereas the 5 bars farthest to the right 
each represent a single year. This is an extremely misleading graph. 


Misleading Units of Measurement 


The units shown on a graph can be different from those that the reader would con- 
sider important. For example, Figure 9.10 shows a graph with the heading, “Rising 
Postal Rates.” It accurately represents how the cost of a first-class stamp rose from 
1971 to 1991. However, notice that the fine print at the bottom reads, “In 1971 dol- 
lars, the price of a 32-cent stamp in February 1995 would be 8.4 cents.” A more 
truthful picture would show the changing price of a first-class stamp adjusted for in- 
flation. As the footnote implies, such a graph would show little or no rise in postal 
rates as a function of the worth of a dollar. 


Using Poor Information 


A picture can only be as accurate as the information that was used to design it. All 
of the cautions about interpreting the collection of information given in Part 1 of this 


Figure 9.9 

The distance between 
successive bars keeps 
changing. 

Source: Washington Post graph 
reprinted in Wainer, 1984. 
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Figure 9.11 
A graph based on poor 
information 


Source: The Independent on 
Sunday (London), 13 March 
1994. 
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book apply to graphs and plots as well. You should always be told the source of in- 
formation presented in a picture, and an accompanying article should give you as 
much information as necessary to determine the worth of that information. 

Figure 9.11 shows a graph that appeared in the London newspaper the Indepen- 
dent on Sunday on March 13, 1994. The accompanying article was titled, “Sniffers 
Quit Glue for More Lethal Solvents.” The graph appears to show that very few 
deaths occurred in Britain from solvent abuse before the late 1970s. However, the ac- 
companying article includes the following quote, made by a research fellow at the 
unit where the statistics are kept: “It’s only since we have started collecting accurate 
data since 1982 that we have begun to discover the real scale of the problem” (p. 5). 
In other words, the article indicates that the information used to create the graph is 
not at all accurate until at least 1982. Therefore, the apparent sharp increase in deaths 
linked to solvent abuse around that time period is likely to have been simply a sharp 
increase in deaths reported and classified. 

Don’t forget that a statistical picture isn’t worth much if the data can’t be trusted. 
Once again, you should familiarize yourself to the extent possible with the Seven 
Critical Components listed in Chapter 2 (pp. 18-19). 
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9.5 A Checklist for Statistical Pictures 


CASE STUDY 9.1 


To summarize, here are 10 questions you should ask when you look at a sta- 
tistical picture—before you even begin to try to interpret the data displayed. 
1. Does the message of interest stand out clearly? 
2. Is the purpose or title of the picture evident? 


3. Is a source given for the data, either with the picture or in an accompa- 
nying article? 
4. Did the information in the picture come from a reliable, believable 
source? 
. Is everything clearly labeled, leaving no ambiguity? 
. Do the axes start at zero or not? 


. Do the axes maintain a constant scale? 


o N A 


. Are there any breaks in the numbers on the axes that may be easy to 
miss? 


9. For financial data, have the numbers been adjusted for inflation? 


10. Is there information cluttering the picture or misleading the eye? 


Time to Panic about Illicit Drug Use? 


The graph illustrated in Figure 9.12 (see next page) appeared on the Web site for the 
U.S. Department of Justice, Drug Enforcement Agency, in spring 1998 (http://www. 
usdoj.gov/dea/drugdata/cp-316.htm). The headline over the graph reads “Emergency 
Situation among Our Youth.” Look quickly at the graph, and describe what you see. 
Did it lead you to believe that almost 80% of 8th-graders used illicit drugs in 1996, 
compared with only about 10% in 1992? The graph is constructed so that you might 
easily draw that conclusion. Notice that careful reading indicates otherwise, and cru- 
cial information is missing. The graph tells us only that in 1996 the rate of use was 
80% higher, or 1.8 times what it was in 1991. The actual rate of use is not provided 
at all in the graph. Only after searching the remainder of the Web site does that 
emerge. The rate of illicit drug use among 8th-graders in 1991 was about 11%, and 
thus, in 1996, it was about 1.8 times that, or about 19.8%. Additional information 
elsewhere on the Web site indicates that about 8% of 8th-graders used marijuana in 
1991, and thus this was the most common illicit drug used. These are still disturbing 
statistics, but not as disturbing as the graph would lead you to believe. 
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Figure 9.12 


; ; Percentage Increase in Lifetime Use of Any Illicit Drug 
Emergency situation gol. among 8th-Graders between 1991 and 1996 
among our youth: 
8th-grade drug use 
Source: U.S. Dept. of Justice. 
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Exercises Asterisked (*) exercises are included in the Solutions at the back of the book. 


*1. Give the name of a type of statistical picture that could be used for each of the 
following kinds of data: 


*a. One categorical variable 
*b. One measurement variable 
*c. Two categorical variables 
*d. Two measurement variables 


2. Suppose a real estate company in your area sold 100 houses last month, whereas 
their two major competitors sold 50 houses and 25 houses, respectively. The top 
company wants to display its better record with a pictogram using a simple two- 
dimensional picture of a house. Draw two pictograms displaying this informa- 
tion, one of which is misleading and one of which is not. (The horizontal axis 
should list the three companies and the vertical axis should list the number of 
houses sold.) 


3. One method used to compare authors or to determine authorship on unsigned 
writing is to look at the frequency with which words of different lengths appear 
in a piece of text. For this exercise, you are going to compare your own writing 
with that of the author of this book. 


15, 


CHAPTER 9 Plots, Graphs, and Pictures 175 


a. Using the first full paragraph of this chapter (not the Thought Questions), 
create a pie chart with three segments, showing the relative frequency of 
words of 1 to 3 letters, 4 to 5 letters, and 6 or more letters in length. (Do not 
include the numbered list after the paragraph.) 


b. Find a paragraph of your own writing of at least 50 words. Repeat part a of 
this exercise for your own writing. 

c. Display the data in parts a and b of this exercise using a single bar graph that 
includes the information for both writers. 

d. Discuss how your own writing style is similar to or different from that of the 
author of this book, as evidenced by the pictures in parts a to c. 


e. Name one advantage of displaying the information in two pie charts and one 
advantage of displaying the information in a single bar graph. 


. An article in Science (23 January 1998, 279, p. 487) reported on a “telephone 


survey of 2600 parents, students, teachers, employers, and college professors” 
in which people were asked the question, “Does a high school diploma mean 
that a student has at least learned the basics?” Results were as follows: 


Professors Employers Parents Teachers Students 
Yes 22% 35% 62% 73% 77% 
No 76% 63% 32% 26% 22% 


a. The article noted that “there seems to be a disconnect between the producers 
[parents, teachers, students] and the consumers [professors, employers] of 
high school graduates in the United States. Create a bar graph from this 
study that emphasizes this feature of the data. 


b. Create a bar graph that deemphasizes the issue raised in part a. 


Figure 9.10, which displays rising postal rates, is an example of a graph with 
misleading units because the prices are not adjusted for inflation. The graph ac- 
tually has another problem as well. Use the checklist in Section 9.5 to determine 
the problem; then redraw the graph correctly (but still use the unadjusted 
prices). Comment on the difference between Figure 9.10 and your new picture. 


. In its February 24-26, 1995 edition (p. 7), USA Weekend gave statistics on the 


changing status of which parent children live with. As noted in the article, the 
numbers don’t total 100% because they are drawn from two sources: the U.S. 
Census Bureau and America’s Children: Resources from Family, Government, 
and the Economy by Donald Hernandez (New York: Russell Sage Foundation, 
1995). Using the data shown in Table 9.1, draw a bar graph presenting the in- 
formation. Be sure to include all the components of a good statistical picture. 


. Figure 10.4 in Chapter 10 displays the success rate for professional golfers when 


putting at various distances. Discuss the figure in the context of the material in 
this chapter. Are there ways in which the picture could be improved? 
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Table 9.1 

Kids Live With 1960 1980 1990 
Father and mother 80.6% 62.3% 57.7% 
Mother only 7.7% 18.0% 21.6% 
Father only 1.0% 1.7% 3.1% 
Father and stepmother 0.8% 1.1% 0.9% 
Mother and stepfather 5.9% 8.4% 10.4% 
Neither parent 3.9% 5.8% 4.3% 


*8. Table 9.2 indicates the population (in millions) and the number of violent crimes 
(in millions) in the United States from 1982 to 1991, as reported in the World 
Almanac and Book of Facts (1993, p. 948). 


a. Draw two line graphs representing the trend in violent crime over time. Draw 
the first graph to try to convince the reader that the trend is quite ominous. 
Draw the second graph to try to convince the reader that it is not. Make sure 
all of the other features of your graph meet the criteria for a good picture. 


*b. Draw a scatterplot of population versus violent crime, making sure it meets 
all the criteria for a good picture. Comment on the scatterplot. Now explain 
why drawing a line graph of violent crime versus year, as in part a of this ex- 
ercise, might be misleading 


c. Rather than using number of violent crimes on the vertical axis, redraw the 
first line graph (from part a) using a measure that adjusts for the increase in 
population. Comment on the differences between the two graphs. 


Table 9.2 U.S. Population and Violent Crime* 
Year 1982 1983 1985 1986 1987 1988 1989 1990 1991 


U.S. population 231 234 239 241 243 246 248 249 252 
Violent crime 1.32 1.26 1.33 149 1.48 1.57 1.65 1.82 1.91 


*Figures for 1984 were not available in the original. 


9. Find an example of a statistical picture in a newspaper or magazine or on the In- 
ternet. Answer the 10 questions in Section 9.5 for the picture. In the process of 
answering the questions, explain what (if any) features you think should have 
been added or changed to make it a good picture. Include the picture with your 
answer. 


10. According to the American Medical Association Family Medical Guide (1982, 
p. 422), the distribution of blood types in the United States in the late 1970s was 
as shown in Table 9.3. 


a. Draw a pie chart illustrating the blood-type distribution for white Americans, 
ignoring the RH factor. 


b. Draw a statistical picture incorporating all of the information given. 
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Table 9.3 Blood Types in the United States in the 1970s 


White Americans African Americans 
Blood Type Rh+ Rh— Rh+ Rh— 
A 38.8% 7.0% 26.0% 2.0% 
B 7.0% 1.0% 17.0% 1.5% 
AB 3.0% 0.6% 4.0% 0.4% 
O 37.0% 6.0% 45.0% 4.0% 


11. 


*12. 


13. 


Find an example of a statistical picture in a newspaper or magazine that has at 
least one of the problems listed in Section 9.4, “Difficulties and Disasters in 
Plots, Graphs, and Pictures.” Explain the problem. If you think anything should 
have been done differently, explain what and why. Include the picture with your 
answer. 


Find a graph that does not start at zero. Redraw the picture to start at zero. Dis- 
cuss the pros and cons of the two versions. 


According to an article in The Seattle Times (Meckler, 2003), living organ 
donors are most often related to the organ recipient. Table 9.4 gives the per- 
centages of each type of relationship for all 6613 cases where an organ was 
transplanted from a living donor in 2002 in the United States. Create a pie chart 
displaying the relationship of the donor to the recipient and write a few sen- 
tences describing the data. 


Table 9.4 Living Donor’s Relationship to Organ 


Transplant Recipient for All Cases in the 
United States in 2002 


Relationship Percent of Donors 
Sibling 30% 
Child 19% 
Parent 13% 
Spouse 11% 
Other relative 8% 
Not related 19% 


14. Table 9.5 provides the total number of men and women who were employed in 


1971, 1981, 1991, and 2001 in the United States. 
a. Create a bar graph for the data. 


b. Compare the bar graph to the one in Figure 9.2, which presents the percent 
of men and women who were employed. Discuss what can be learned from 
each graph that can’t be learned from the other. 
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Table 9.5 Total Number of Men and Women 


in the U.S. Labor Force (in millions) 


Year Men Employed Women Employed 


1971 49.4 30.0 
1981 57.4 43.0 
1991 64.2 93:5 
2001 13:2 63.7 


Source: Current Population Survey, Bureau of Labor Statistics. 


Refer to Additional News Story 19 on the CD, a press release from Cornell Uni- 
versity entitled “Puppy love’s dark side: First study of love-sick teens reveals 
higher risk of depression, alcohol use and delinquency.” The article includes a 
graph labeled “Adjusted change in depression between interviews.” Comment 
on the graph. 


Refer to Figure 1 on page 691 of Original Source 11 on the CD, “Driving im- 
pairment due to sleepiness is exacerbated by low alcohol intake.” 


a. What type of picture is in the figure? 


b. Write a few sentences explaining what you learn from the picture about lane 
drifting episodes under the different conditions. 


. Refer to Figure 2 on page 691 of Original Source 11 on the CD, “Driving im- 


pairment due to sleepiness is exacerbated by low alcohol intake.” 
a. What type of picture is in the figure? 


b. Write a few sentences explaining what you learn from the picture about sub- 
jective sleepiness ratings under the different conditions. 


Mini-Projects 


. Collect some categorical data on a topic of interest to you and represent it in a 


statistical picture. Explain what you have done to make sure the picture is as 
useful as possible. 


. Collect two measurement variables on each of at least 10 individuals. Represent 


them in a statistical picture. Describe the picture in terms of possible outliers, 
variability, and relationship between the two variables. 


. Find some data that represent change over time for a topic of interest to you. 


Present a line graph of the data in the best possible format. Explain what you 
have done to make sure the picture is as useful as possible. 
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Relationships Between 
Measurement Variables 


Thought Questions 


1. 


Judging from the scatterplot in Figure 9.5, there is a positive correlation between ver- 
bal SAT score and GPA. For used cars, there is a negative correlation between the age 
of the car and the selling price. Explain what it means for two variables to have a pos- 
itive correlation or a negative correlation. 


. Suppose you were to make a scatterplot of (adult) sons’ heights versus fathers’ 


heights by collecting data on both from several of your male friends. You would now 
like to predict how tall your nephew will be when he grows up, based on his father’s 
height. Could you use your scatterplot to help you make this prediction? Explain. 


. Do you think each of the following pairs of variables would have a positive correla- 


tion, a negative correlation, or no correlation? 

a. Calories eaten per day and weight. 

b. Calories eaten per day and IQ. 

c. Amount of alcohol consumed and accuracy on a manual dexterity test. 

d. Number of ministers and number of liquor stores in cities in Pennsylvania. 
e. Height of husband and height of wife. 


. An article in the Sacramento Bee (29 May 1998, p. A17) noted, “Americans are just 


too fat, researchers say, with 54 percent of all adults heavier than is healthy. If the 
trend continues, experts say that within a few generations virtually every U.S. adult 
will be overweight.” This prediction is based on “extrapolating,” which assumes the 
current rate of increase will continue indefinitely. Is that a reasonable assumption? Do 
you agree with the prediction? Explain. 
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10.1 Statistical Relationships 


One of the interesting advances made possible by the use of statistical methods is the 
quantification and potential confirmation of relationships. In the first part of this 
book, we discussed relationships between aspirin and heart attacks, meditation and 
aging, and smoking during pregnancy and child’s IQ, to name just a few. In Chapter 
9, we saw examples of relationships between two variables illustrated with pictures, 
such as the scatterplot of verbal SAT scores and college GPAs. 

Although we have examined many relationships up to this point, we have not 
considered how those relationships could be expressed quantitatively. In this chap- 
ter, we discuss correlation, which measures the strength of a certain type of rela- 
tionship between two measurement variables, and regression, which is a numerical 
method for trying to predict the value of one measurement variable from knowing 
the value of another one. 


Statistical Relationships versus 
Deterministic Relationships 


A Statistical relationship differs from a deterministic relationship in that, in the 
latter case, if we know the value of one variable, we can determine the value of the 
other exactly. For example, the relationship between volume and weight of water is 
deterministic. The old saying, “A pint’s a pound the world around,” isn’t quite true, 
but the deterministic relationship between volume and weight of water does hold. (A 
pint is actually closer to 1.04 pounds.) We can express the relationship by a formula, 
and if we know one value, we can solve for the other (weight in pounds = 1.04 X 
volume in pints). 


Natural Variability in Statistical Relationships 


In a statistical relationship, natural variability exists in both measurements. For ex- 
ample, we could describe the average relationship between height and weight for 
adult females, but very few women would fit that exact formula. If we knew a 
woman’s height, we could predict the average weight for all women with that same 
height, but we could not predict her weight exactly. Similarly, we can say that, on 
average, taking aspirin every other day reduces one’s chance of having a heart attack, 
but we cannot predict what will happen to one specific individual. 

Statistical relationships are useful for describing what happens to a popula- 
tion, or aggregate. The stronger the relationship, the more useful it is for predict- 
ing what will happen for an individual. When researchers make claims about 
statistical relationships, they are not claiming that the relationship will hold for 
everyone. 
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10.2 Strength versus Statistical Significance 


To find out if a statistical relationship exists between two variables, researchers must 
usually rely on measurements from only a sample of individuals from a larger pop- 
ulation. However, for any particular sample, a relationship may exist even if there is 
no relationship between the two variables in the population. It may be just the “luck 
of the draw” that that particular sample exhibited the relationship. 

For example, suppose an observational study followed for 5 years a sample of 
1000 owners of satellite dishes and a sample of 1000 nonowners and found that 4 of 
the satellite dish owners developed brain cancer, whereas only 2 of the nonowners 
did. Could the researcher legitimately claim that the rate of cancer among all satel- 
lite dish owners is twice that among nonowners? You would probably not be per- 
suaded that the observed relationship was indicative of a problem in the larger 
population. The numbers are simply too small to be convincing. 


Defining Statistical Significance 


To overcome this problem, statisticians try to determine whether an observed rela- 
tionship in a sample is statistically significant. To determine this, we ask what the 
chances are that a relationship that strong or stronger would have been observed in the 
sample if there really were nothing going on in the population. If those chances are 
small, we declare that the relationship is statistically significant and was not just a 
fluke. To be convincing, an observed relationship must also be statistically significant. 


Most researchers are willing to declare that a relationship is statistically sig- 
nificant if the chances of observing the relationship in the sample when actu- 
ally nothing is going on in the population are less than 5%. In other words, a 
relationship is considered to be statistically significant if that relationship is 
stronger than 95% of the relationships we would expect to see just by chance. 


Of course, this reasoning carries with it the implication that of all the relationships 
that do occur by chance alone, 5% of them will erroneously earn the title of statisti- 
cal significance. However, this is the price we pay for not being able to measure the 
entire population—while still being able to determine that statistically significant re- 
lationships do exist. We will learn how to assess statistical significance in Chapters 
13, 22, and 23. 


Two Warnings about Statistical Significance 


Two important points, which we will study in detail in Chapter 24, often lead peo- 
ple to misinterpret statistical significance. First, it is easier to rule out chance if the 
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observed relationship is based on very large numbers of observations. Even a minor 
relationship will achieve “statistical significance” if the sample is very large. How- 
ever, earning that title does not necessarily imply that there is a strong relationship 
or even one of practical importance. 


EXAMPLE 1 Small but Significant Increase in Risk of Breast Cancer 


News Story 12 in the Appendix, “Working nights may increase breast cancer risk,” con- 
tains the following quote by Francine Laden, one of the co-authors of the study: “The 
numbers in our study are small, but they are statistically significant.” As a reader, what 
do you think that means? Reading further in the news story reveals the answer: 


The study was based on more than 78,000 nurses from 1988 through 1998. It 
found that nurses who worked rotating night shifts at least three times a month for 
one to 29 years were 8 percent more likely to develop breast cancer. For those who 
worked the shifts for more than 30 years, the relative risk went up by 36 percent. 


The “small numbers” Dr. Laden referenced were the small increases in the risk of breast 
cancer, of 8 percent and 36 percent (especially the 8 percent). Because the study was 
based on over 78,000 women, even the small relationship observed in the sample prob- 
ably reflects a real relationship in the population. In other words, the relationship in the 
sample, while not strong, is “statistically significant.” a 


Second, a very strong relationship won’t necessarily achieve “statistical signifi- 
cance” if the sample is very small. If you read about researchers who “failed to find 
a Statistically significant relationship” between two variables, do not be confused 
into thinking that they have proven that there isn’t a relationship. It may be that they 
simply didn’t take enough measurements to rule out chance as an explanation. 


EXAMPLE 2 Do Younger Drivers Eat and Drink More while Driving? 
git News Story 5 in the Appendix, “Driving while distracted is common, researchers say” 


contains the following quote: 
so 


= Stutts’ team had to reduce the sample size from 144 people to 70 when they ran 
into budget and time constraints while minutely cataloging hundreds of hours of 
video. The reduced sample size does not compromise the researchers’ findings, 
Stutts said, although it does make analyzing population subsets difficult. 


What does this mean? Consulting the report listed as Original Source 5 in the Appen- 
dix, one example explicitly stated is when the researchers tried to compare behavior 
across age groups. For instance, in Table 7 of the report (p. 36) it is shown that 92.9 per- 
cent of 18- to 29-year-old drivers were eating or drinking while driving. Middle-aged dri- 
vers weren't as bad, with 71.4 percent of drivers in their 30s and 40s and 78.6 percent 
of drivers in their 50s eating or drinking. And a mere 42.9 percent of drivers 60 and over 
were observed eating or drinking while driving. It would seem that these reflect real dif- 
ferences in behavior in the population of all drivers, and not just in the drivers observed 
in this study. But because there were only 14 drivers observed in each age group, the 
observed relationship between age and eating behavior is not statistically significant. It 
is impossible to know whether or not the relationship exists in the population. The au- 
thors of the report wrote: 
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Compared to older drivers, younger drivers appeared more likely to eat or drink 
while driving. .. . Sample sizes within age groups, however, were small, prohibit- 
ing valid statistical testing. (pp. 61-62) 


Notice that in this example, the authors of the original report and the journalist who 
wrote the news story interpreted the problem correctly. An incorrect, and not uncom- 
mon, interpretation would be to say that “no significant difference was found in eating 
and drinking behavior across age groups.” While technically true, this language would 
lead readers to believe that there is no difference in these behaviors in the population, 
when in fact the sample was just too small to decide one way or the other. E 


10.3 Measuring Strength Through Correlation 


A Linear Relationship 


It is convenient to have a single number to measure the strength of the relationship 
between two measurement variables and to have that number be independent of the 
units used to make the measurements. For instance, if height is reported in inches in- 
stead of centimeters, the strength of the relationship between height and weight 
should not change. 

Many types of relationships can occur between measurement variables, but in 
this chapter we consider only the most common one. The correlation between two 
measurement variables is an indicator of how closely their values fall to a straight 
line. Sometimes this measure is called the Pearson product-moment correlation or 
the correlation coefficient or is simply represented by the letter r: 

Notice that the statistical definition of correlation is more restricted than its com- 
mon usage. For example, if the value of one measurement variable is always the 
square of the value of the other variable, they have a perfect relationship but may still 
have no statistical correlation. As used in statistics, correlation measures linear re- 
lationships only; that is, it measures how close the individual points in a scatterplot 
are to a straight line. 


Other Features of Correlations 


Here are some other features of correlations: 


1. A correlation of +1 (or 100%) indicates that there is a perfect linear rela- 
tionship between the two variables; as one increases, so does the other. In 
other words, all individuals fall on the same straight line, just as when 
two variables have a deterministic linear relationship. 


2. A correlation of —1 also indicates that there is a perfect linear relation- 
ship between the two variables; however, as one increases, the other 
decreases. 


EXAMPLE 3 


EXAMPLE 4 


EXAMPLE 5 
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3. A correlation of zero could indicate that there is no linear relationship be- 
tween the two variables. It could also indicate that the best straight line 
through the data on a scatterplot is exactly horizontal. 


4. A positive correlation indicates that the variables increase together. 


5. A negative correlation indicates that as one variable increases, the other 
decreases. 

6. Correlations are unaffected if the units of measurement are changed. For 
example, the correlation between weight and height remains the same re- 
gardless of whether height is expressed in inches, feet, or millimeters. 


Examples of Positive and Negative Relationships 


Following are some examples of both positive and negative relationships. Notice 
how the closeness of the points to a straight line determines the magnitude of the cor- 
relation, whereas whether the line slopes up or down determines if the correlation is 
positive or negative. 


Verbal SAT and GPA 


In Chapter 9, we saw a scatterplot showing the relationship between the two variables 
verbal SAT and GPA for a sample of college students. The correlation for the data in the 
scatterplot is .485, indicating a moderate positive relationship. In other words, students 
with higher verbal SAT scores tend to have higher GPAs as well, but the relationship is 
nowhere close to being exact. a 


Husbands’ and Wives’ Ages and Heights 


Marsh (1988, p. 315) and Hand et al. (1994, pp. 179-183) reported data on the ages 
and heights of a random sample of 200 married couples in Britain, collected in 1980 by 
the Office of Population Census and Surveys. Figures 10.1 and 10.2 (see next page) 
show scatterplots for the ages and the heights, respectively, of the couples. Notice that 
the ages fall much closer to a straight line than do the heights. In other words, hus- 
bands’ and wives’ ages are likely to be closely related, whereas their heights are less 
likely to be so. The correlation between husbands’ and wives’ ages is .94, whereas the 
correlation between their heights is only .36. Thus, the values for the correlations con- 
firm what we see from looking at the scatterplots. a 


Occupational Prestige and Suicide Rates 


Labovitz (1970, Table 1) and Hand et al. (1994, pp. 395-396) listed suicide rates and pres- 
tige ratings for 36 occupations in the United States. The suicide rates were for men aged 
20 to 64; the prestige ratings were determined by the National Opinion Research Center. 
Figure 10.3 (see page 187) displays a scatterplot of the data. Notice that there does not 
appear to be much of a relationship between suicide rates and occupational prestige, and 
the correlation of .109 confirms that fact. You should also notice the outlier on the 
plot with a very high suicide rate and a somewhat high prestige rating. That point corre- 
sponds to the occupation of “managers, officials, and proprietors—self-employed— 
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manufacturing.” The outlier also appears to be responsible for the weak positive correla- 
tion. In fact, if that point is removed, the correlation drops to .018, very near zero. There- 
fore, we can conclude that there is little relationship between occupational prestige and 
suicide rates. E 


EXAMPLE 6 Professional Golfers’ Putting Success 


Iman (1994, p. 507) reported on a study conducted by Sports Illustrated magazine in 
which the magazine studied success rates at putting for professional golfers. Using data 


Figure 10.3 
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tige for 36 occupations; 
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from 15 tournaments, the researchers determined the percentage of successful putts at 
distances from 2 feet to 20 feet. We have restricted our attention to the part of the data 
that follows a linear relationship, which includes putting distances from 5 feet to 15 feet. 
Figure 10.4 illustrates this relationship. The correlation between distance and rate of suc- 
cess is —.94. Notice the negative sign, which indicates that as distance goes up, success 


rate goes down. 
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10.4 Specifying Linear Relationships with Regression 


Sometimes, in addition to knowing the strength of the connection between two vari- 
ables, we would like a formula for the relationship. For example, it might be useful 
for colleges to have a formula for the connection between verbal SAT score and col- 
lege GPA. They could use it to predict the potential GPAs of future students. Some 
colleges do that kind of prediction to decide whom to admit, but they use a collec- 
tion of variables instead of just one. The simplest kind of relationship between two 
variables is a straight line, and that’s the only type we discuss here. Our goal is to 
find a straight line that comes as close as possible to the points in a scatterplot. 


Defining Regression 


We call the procedure we use to find a straight line that comes as close as possible 
to the points in a scatterplot regression; the resulting line, the regression line; and 
the formula that describes the line, the regression equation. You may wonder why 
that word is used. Until now, most of the vocabulary borrowed by statisticians had 
at least some connection to the common usage of the words. The use of the word re- 
gression dates back to Francis Galton, who studied heredity in the late 1800s. (See 
Stigler, 1986 or 1989, for a detailed historical account.) One of Galton’s interests 
was whether a man’s height as an adult could be predicted by his parents’ heights. 
He discovered that it could, but that the relationship was such that very tall parents 
tended to have children who were shorter than they were, and very short parents 
tended to have children taller than themselves. He initially described this phenome- 
non by saying there was “reversion to mediocrity” but later changed to the termi- 
nology “regression to mediocrity.” Henceforth, the technique of determining such 
relationships has been called regression. 

How are we to find the best straight line relating two variables? We could just 
take a ruler and try to fit a line through the scatterplot, but each of us would proba- 
bly get a different answer. Instead, the most common procedure is to find what is 
called the least squares line. In determining the least squares line, priority is given 
to how close the points fall to the line for the variable represented by the vertical 
axis. Those distances are squared and added up for all of the points in the sample. 
For the least squares line, that sum is smaller than it would be for any other line. 

The vertical distances are chosen because the equation is often used to predict 
that variable when the one on the horizontal axis is known. Therefore, we want to 
minimize how far off the prediction would be in that direction. In other words, the 
horizontal axis usually represents an explanatory variable, and the vertical axis rep- 
resents a response variable. We want to predict the value of the response variable 
from knowing the value of the explanatory variable. The line we use is the one that 
minimizes the sum of the squared errors resulting from this prediction for the indi- 
viduals in the sample. The reasoning is that if the line is good at predicting the re- 
sponse for those in the sample, when the response is already known, then it will work 
well for predicting the response in the future when only the explanatory variable is 
known. 


Figure 10.5 

A straight line with in- 
tercept of 32 and slope 
of 1.8 


EXAMPLE 7 
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The Equation for the Line 


All straight lines can be expressed by the same formula. Using standard conventions, 
we call the variable on the vertical axis y and the variable on the horizontal axis x. 
We can then write the equation for the line relating them as 


y=a+ bx 


where for any given situation, a and b would be replaced by numbers. We call the 
number represented by a the intercept and the number represented by b the slope. 
The intercept simply tells us at what particular point the line crosses the vertical axis 
when the horizontal axis is at zero. The slope tells us how much of an increase there 
is for one variable (the one on the vertical axis) when the other (on the horizontal 
axis) increases by one unit. A negative slope indicates a decrease in one variable as 
the other increases, just as a negative correlation does. 

For example, Figure 10.5 shows the (deterministic) relationship between y = 
temperature in Fahrenheit and x = temperature in Celsius. The equation for the re- 
lationship is 


y = 32+ 18x 


The intercept, 32, is the temperature in Fahrenheit when the Celsius temperature is 
zero. The slope, 1.8, is the amount by which Fahrenheit temperature increases when 
Celsius temperature increases by one unit. 


Husbands’ and Wives’ Ages, Revisited 


Figure 10.6 shows the same scatterplot as Figure 10.1, relating ages of husbands and 
wives, except that now we have added the regression line. This line minimizes the sum 
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Figure 10.6 

Scatterplot and regres- 
sion line for British hus- 
bands’ and wives’ ages 
Source: Hand et al., 1994. 
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of the squared vertical distances between the line and the husbands’ actual ages. The 
regression equation for the line shown in Figure 10.6, relating husbands’ and wives’ 
ages, is 


y = 3.6 + .97x 
or, equivalently, 
husband's age = 3.6 + (.97)(wife’s age) 


Notice that the intercept of 3.6 does not have any meaning in this example. It would 
be the predicted age of the husband of a woman whose age is 0. But obviously that’s 
not a possible wife’s age. The slope does have a reasonable interpretation. For every year 
of difference in two wives’ ages, there is a difference of about .97 years in their hus- 
bands’ ages, close to 1 year. For instance, if two women are 10 years apart in age, their 
husbands can be expected to be about (.97) X 10 = 9.7 years apart in age. 

Let's use the equation to predict husband's age at various wife's ages. 


Wife’s Age Predicted Age of Husband 


20 years 3.6 + (.97)(20) = 23.0 years 
25 years 3.6 + (.97)(25) = 27.9 years 
40 years 3.6 + (.97)(40) = 42.4 years 
55 years 3.6 + (.97)(55) = 57.0 years 


This table shows that for the range of ages in the sample, husbands tend to be 2 to 3 
years older than their wives, on average. The older the couple, the smaller the gap in 
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their ages. Remember that with statistical relationships, we are determining what hap- 
pens to the average and not to any given individual. Thus, although most couples won't 
fit the pattern given by the regression line exactly, it does show us one way to represent 
the average relationship for the whole group. a 


Extrapolation 


It is generally not a good idea to use a regression equation to predict values far out- 
side the range where the original data fell. There is no guarantee that the relationship 
will continue beyond the range for which we have data. For example, using the re- 
gression equation illustrated in Figure 10.6, we would predict that women who are 
100 years old have husbands whose average age is 100.6 years. But women tend to 
live to be older than men, so it is more likely that if a woman is married at 100, her 
husband is younger than she is. The relationship for much older couples would be 
affected by differing death rates for men and women, and a different equation would 
most likely apply. It is typically acceptable to use the equation only for a minor ex- 
trapolation beyond the range of the original data. 


A Final Cautionary Note 


It is easy to be misled by inappropriate interpretations and uses of correlation and re- 
gression. In the next chapter, we examine how that can happen, and how you can 
avoid it. 


Are Attitudes about Love and Romance Hereditary? 
Source: Waller and Shaver (September 1994). 


Are you the jealous type? Do you think of love and relationships as a practical mat- 
ter? Which of the following two statements better describes how you are likely to fall 
in love? 


My lover and I were attracted to each other immediately after we first met. 


It is hard for me to say exactly when our friendship turned into love. 


If the first statement is more likely to describe you, you would probably score high 
on what psychologists call the Eros dimension of love, characteristic of those who 
“place considerable value on love and passion, are self-confident, enjoy intimacy 
and self-disclosure, and fall in love fairly quickly” (Waller and Shaver, 1994, p.268). 
However, if you identify more with the second statement, you would probably score 
higher on the Storge dimension, characteristic of those who “value close friendship, 
companionship, and reliable affection” (p. 268). Whatever your beliefs about love 
and romance, do you think they are partially inherited, or are they completely due to 
social and environmental influences? 

Psychologists Niels Waller and Philip Shaver set out to answer the question of 
whether feelings about love and romance are partially genetic, as are most other per- 
sonality traits. Waller and Shaver studied the love styles of 890 adult twins and 172 
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single twins and their spouses from the California Twin Registry. They compared the 
similarities between the answers given by monozygotic twins (MZ), who share 
100% of their genes, to the similarities between those of dizygotic twins (DZ), who 
share, on average, 50% of their genes. They also studied the similarities between the 
answers of twins and those of their spouses. If love styles are genetic, rather than de- 
termined by environmental and other factors, then the matches between MZ twins 
should be substantially higher than those between DZ twins. 

Waller and Shaver studied 345 pairs of MZ twins, 100 pairs of DZ twins, and 172 
spouse pairs (that is, a twin and his or her spouse). Each person filled out a ques- 
tionnaire called the “Love Attitudes Scale” (LAS), which asked them to read 42 
statements like the two given earlier. For each statement, respondents assigned a 
ranking from 1 to 5, where 1 meant “strongly agree” and 5 meant “strongly dis- 
agree.” There were seven questions related to each of six love styles, with a score de- 
termined for each person on each love style. Therefore, there were six scores for 
each person. 

In addition to the two styles already described (Eros and Storge), scores were 
generated for the following four: 


m Ludus characterizes those who “value the fun and excitement of romantic rela- 
tionships, especially with multiple alternative partners; they generally are not in- 
terested in mutual self-disclosure, intimacy or ‘getting serious’ ” (p. 268). 


m Pragma types are “pragmatic, entering a relationship only if it meets certain prac- 
tical criteria” (p. 269). 


m Mania types “are desperate and conflicted about love. They yearn intensely for 
love but then experience it as a source of pain, a cause of jealousy, a reason for in- 
somnia” (p. 269). 


m Those who score high on Agape “are oriented more toward what they can give to, 
rather than receive from, a romantic partner. Agape is a selfless, almost spiritual 
form of love” (p. 269). 


For each type of love style, and for each of the three types of pairs (MZ twins, DZ 
twins, and spouses), the researchers computed a correlation. The results are shown 
in Table 10.1. (They first removed effects due to age and gender, so the correlations 
are not due to a relationship between love styles and age or gender.) Notice that the 
correlations for the MZ twins are lower than they are for the DZ twins for two love 
styles, and just slightly higher for the other four styles. This is in contrast to most 
other personality traits. For comparison purposes, three such traits are also shown in 
Table 10.1. Notice that for those traits, the correlations are much higher for the MZ 
twins, indicating a substantial hereditary component. Regarding the findings for love 
styles, Waller and Shaver conclude: 


This surprising, and very unusual, finding suggests that genes are not impor- 

tant determinants of attitudes toward romantic love. Rather, the common envi- 
ronment appears to play the cardinal role in shaping familial resemblance on 
these dimensions. (p. 271) 


CASE STUDY 10.2 
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Table 10.1 Correlations for Love Styles and for Some Personality Traits 


Correlation 
Monozygotic Twins Dizygotic Twins Spouses 

Love Style 

Eros 16 14 36 
Ludus 18 30 .08 
Storge 18 12 22 
Pragma 40 32 29 
Mania 35 27 —.01 
Agape 30 37 28 
Personality Trait 

Well-being 38 13 04 
Achievement 43 16 .08 
Social closeness .38 .01 —.04 


Source: Waller and Shaver, September 1994. 


A Weighty Issue: Women Want Less, Men Want More 


Do you like your weight? Let me guess ... If you’re male and under about 175 
pounds, you probably want to weigh the same or more than you do. If you’re female, 
no matter what you weigh, you probably want to weigh the same or less. Those were 
the results uncovered in a large statistics class (119 females and 63 males) when stu- 
dents were asked to give their actual and their ideal weights. 

Figure 10.7 on the next page shows a scatterplot of ideal versus actual weight for 
the females, and Figure 10.8 is the same plot for the males. Each point represents 
one student, whose ideal weight can be read on the vertical axis and actual weight 
can be read on the horizontal axis. What is the relationship between ideal and actual 
weight, on average, for men and for women? 

First, notice that if everyone were at their ideal weight, all points would fall on 
a line with the equation 


ideal = actual 


That line is drawn in each figure. Most of the women fall below that line, indicating 
that their ideal weight is below their actual weight. The situation is not as clear for 
the men, but a pattern is still evident. Most of those weighing under 175 pounds fall 
on or above the line (would prefer to weigh the same or more than they do), and most 
of those weighing over 175 pounds fall on or below the line (would prefer to weigh 
the same or less than they do). 

The regression lines are also shown on each scatterplot. The regression equa- 
tions are: 


Women: ideal = 43.9 + 0.6 actual 
Men: ideal = 52.5 + 0.7 actual 
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These equations have several interesting features, which, remember, summarize the re- 
lationship between ideal and average weight for the aggregate, not for each individual: 


m The weight for which ideal = actual is about 110 pounds for women and 175 
pounds for men. Below those weights, actual weight is less than desired; above 
them, actual weight is more than desired. 
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m The slopes represent the increase in ideal weight for each 1-pound increase in ac- 
tual weight. Thus, every 10 pounds of additional weight indicates an increase of 
only 6 pounds in ideal weight for women and 7 pounds for men. Another way to 
think about the slope is that if two women’s actual weights differed by 10 pounds, 
their ideal weights would differ by about (0.6) X 10 = 6 pounds. 


For Those Who Like Formulas 


The Data 


n pairs of observations, (x; y;), i = 1, 2,...,, where x; is plotted on the horizontal 
axis and y; on the vertical axis. 


Summaries of the Data, Useful for Correlation and Regression 
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The Regression Slope and Intercept 


SXY 
slope = b = —— 
SSX 

intercept = a = y — bx 


Exercises 


Asterisked (*) exercises are included in the Solutions at the back of the book. 


*1. Suppose 100 different researchers each did a study to see if there was a rela- 
tionship between coffee consumption and height. Suppose there really is no 
such relationship in the population. Would you expect any of the researchers to 
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*3, 


"T: 


10. 


find a statistically significant relationship? If so, approximately how many? Ex- 
plain your answer. 


. In Figure 10.2, we observed that the correlation between husbands’ and wives’ 


heights, measured in millimeters, was .36. Can you determine what the correla- 
tion would be if the heights were converted to inches (and not rounded off)? 
Explain. 

A pint of water weighs 1.04 pounds, so 1 pound of water is 0.96 pint. Suppose 
a merchant sells water in containers weighing 0.5 pound, but customers can fill 
them to their liking. It is easier to weigh the filled container than to measure the 
volume of water the customer is purchasing. Define x to be the weight of the 
container and the water and y to be the volume of the water. 


*a. Write the equation the merchant would use to determine the volume y when 
x is known. 


b. Specify the numerical values of the intercept and the slope, and interpret 
their physical meanings for this example. 


c. What is the correlation between x and y for this example? 


d. Draw a picture of the relationship between x and y. 


. Are each of the following pairs of variables likely to have a positive correlation 


or a negative correlation? 


a. Daily temperatures at noon in New York City and in Boston measured for a 
year. 


b. Weights of automobiles and their gas mileage in average miles per gallon. 
c. Hours of television watched and grade-point average for college students. 


d. Years of education and salary. 


. Suppose a weak relationship exists between two variables in a population. 


Which would be more likely to result in a statistically significant relationship 
between the two variables: a sample of size 100 or a sample of size 10,000? 
Explain. 


. The relationship between height and weight is a well-established and obvious 


fact. Suppose you were to sample heights and weights for a small number of 
your friends, and you failed to find a statistically significant relationship be- 
tween the two variables. Would you conclude that the relationship doesn’t hold 
for the population of people like your friends? Explain. 


Which implies a stronger linear relationship, a correlation of +.4 or a correla- 
tion of —.6? Explain. 


. Give an example of a pair of variables that are likely to have a positive correla- 


tion and a pair of variables that are likely to have a negative correlation. 


. Explain how two variables can have a perfect curved relationship and yet have 


zero correlation. Draw a picture of a set of data meeting those criteria. 
The regression line relating verbal SAT scores and GPA for the data exhibited 
in Figure 9.5 is 


GPA = 0.539 + (0.00362)(verbal SAT) 
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a. Predict the average GPA for those with verbal SAT scores of 500. 
b. Explain what the slope of 0.00362 represents. 


c. The lowest possible SAT score is 200. Does the intercept of 0.539 have any 
useful meaning for this example? Explain. 


*11. Refer to Case Study 10.2, in which regression equations are given for males and 
females relating ideal weight to actual weight. The equations are 


Women: ideal = 43.9 + 0.6 actual 
Men: ideal = 52.5 + 0.7 actual 


*a. Predict the ideal weight for a man who weighs 150 pounds and for a woman 
who weighs 150 pounds. Compare the results. 


b. Does the intercept of 43.9 have a logical physical interpretation in the con- 
text of this example? Explain. 


c. Does the slope of 0.6 have a logical interpretation in the context of this ex- 
ample? Explain. 

d. Outliers in scatterplots may be within the range of values for each variable 
individually but lie outside the general pattern when the variables are exam- 
ined in combination. A few points in Figures 10.7 and 10.8 could be consid- 
ered as outliers. In the context of this example, explain the characteristics of 
someone who appears as an outlier. 


*12. In Chapter 9, we examined a picture of winning time in men’s 500-meter speed 
skating plotted across time. The data represented in the plot started in 1924 and 
went through 2002. A regression equation relating winning time and year for 
1924 to 1998 is 


winning time = 264.5 — (0.1142)(year) 


*a, Would the correlation between winning time and year be positive or nega- 
tive? Explain. 

*b. In 2002, the actual winning time for the gold medal was 34.42 seconds. Use 
the regression equation to predict the winning time for 2002, and compare 
the prediction to what actually happened. 


*e. Explain what the slope of —0.1142 indicates in terms of how winning times 
change from year to year. 


d. The Olympics are held every four years. Explain what the slope of —0.1142 
indicates in terms of how winning times should change from one Olympics 
to the next. 


13. Explain why we should not use the regression equation we found in Exercise 12 
for speed-skating time versus year to predict the winning time for the 2010 Win- 
ter Olympics. 


14. The regression equation relating distance (in feet) and success rate (percent) for 
professional golfers, based on 11 distances ranging from 5 feet to 15 feet, is 


success rate = 76.5 — (3.95)(distance) 
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15. 


*16. 


a. What percent success would you expect for these professional golfers if the 
putting distance is 6.5 feet? 


b. Explain what the slope of —3.95 means in terms of how success changes 
with distance. 


The original data for the putting success of professional golfers included values 
beyond those we used for this example (5 feet to 15 feet), in both directions. At 
a distance of 2 feet, 93.3% of the putts were successful. At a distance of 20 feet, 
15.8% of the putts were successful. 


a. Use the equation in Exercise 14 to predict success rates for those two dis- 
tances (2 feet and 20 feet). Compare the predictions to the actual success 
rates. 


b. Use your results from part a to explain why it is not a good idea to use a re- 
gression equation to predict information beyond the range of values from 
which the equation was determined. 


c. Based on the picture in Figure 10.4 and the additional information in this ex- 
ercise, draw a picture of what you think the relationship between putting dis- 
tance and success rate would look like for the entire range from 2 feet to 
20 feet. 


d. Explain why a regression equation should not be formulated for the entire 
range from 2 feet to 20 feet. 


In one of the examples in this chapter, we noticed a very strong relationship be- 
tween husbands’ and wives’ ages for a sample of 200 British couples, with a cor- 
relation of .94. Coincidentally, the relationship between putting distance and 
success rate for professional golfers had a correlation of —.94, based on 11 data 
points. This latter correlation was statistically significant, so we can be pretty 
sure the observed relationship was not just due to chance. Based on this infor- 
mation, do you think the observed relationship between husbands’ and wives’ 
ages is statistically significant? Explain. 


. Refer to the journal article given as Original Source 2 on the CD, “Development 


and initial validation of the Hangover Symptoms Scale: Prevalence and corre- 
lates of hangover symptoms in college students.” On page 1447 it says: “The 
HSS [Hangover Symptoms Scale] was significantly positively associated with 
the frequency of drinking (r = 0.44).” 


a. What two variables were measured for each person to provide this result? 
b. Explain what is meant by r = .44. 
c. What is meant by the word significantly as it is used in the quote? 


d. The authors did not provide a regression equation relating the two vari- 
ables. If a regression equation were to be found for the two variables in the 
quote, which one do you think would be the logical explanatory variable? 
Explain. 


CHAPTER 10 Relationships Between Measurement Variables 199 


Mini-Projects 


1. (Computer or statistics calculator required.) Measure the heights and weights 
of 10 friends of the same sex. Draw a scatterplot of the data, with weight on the 
vertical axis and height on the horizontal axis. Using a computer or calculator 
that produces regression equations, find the regression equation for your data. 
Draw it on your scatter diagram. Use this to predict the average weight for peo- 
ple of that sex who are 67 inches tall. 


2. Go to your library or an electronic journal resource and peruse journal articles, 
looking for examples of scatterplots accompanied by correlations. Find three ex- 
amples in different journal articles. Present the scatterplots and correlations, and 
explain in words what you would conclude about the relationship between the 
two variables in each case. 
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Relationships Can 
Be Deceiving 


Thought Questions 


1. 


Use the following two pictures to speculate on what influence outliers have on cor- 
relation. For each picture, do you think the correlation is higher or lower than it 
would be without the outlier? (Hint: Remember that correlation measures how 
closely points fall to a straight line.) 


. Astrong correlation has been found in a certain city in the northeastern United States 


between weekly sales of hot chocolate and weekly sales of facial tissues. Would you 
interpret that to mean that hot chocolate causes people to need facial tissues? 
Explain. 


. Researchers have shown that there is a positive correlation between the average fat 


intake and the breast cancer rate across countries. In other words, countries with 
higher fat intake tend to have higher breast cancer rates. Does this correlation prove 
that dietary fat is a contributing cause of breast cancer? Explain. 


. If you were to draw a scatterplot of number of women in the workforce versus num- 


ber of Christmas trees sold in the United States for each year between 1930 and the 
present, you would find a very strong correlation. Why do you think this would be 
true? Does one cause the other? 
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11.1 Illegitimate Correlations 


EXAMPLE 1 


In Chapter 10, we learned that the correlation between two measurement variables 
provides information about how closely related they are. A strong correlation implies 
that the two variables are closely associated or related. With a positive correlation, 
they increase together, and with a negative correlation, one variable tends to increase 
as the other decreases. 

However, as with any numerical summary, correlation does not provide a com- 
plete picture. A number of anomalies can cause misleading correlations. Ideally, all 
reported correlations would be accompanied by a scatterplot. Without a scatterplot, 
however, you need to ascertain whether any of the problems discussed in this section 
may be distorting the correlation between two variables. 


Watch out for these problems with correlations: 


= Outliers can substantially inflate or deflate correlations. 


m Groups combined inappropriately may mask relationships. 


The Impact Outliers Have on Correlations 


In a manner similar to the effect we saw on means, outliers can have a large impact 
on correlations. This is especially true for small samples. An outlier that is consis- 
tent with the trend of the rest of the data will inflate the correlation. An outlier that 
is not consistent with the rest of the data can substantially decrease the correlation. 


Highway Deaths and Speed Limits 


The data in Table 11.1 come from the time when the United States still had a maximum 
speed limit of 55 miles per hour. The correlation between death rate and speed limit 
across countries is .55, indicating a moderate relationship. Higher death rates tend to be 
associated with higher speed limits. A scatterplot of the data is presented in Figure 11.1; 
the two countries with the highest speed limits are labeled. Notice that Italy has both a 
much higher speed limit and a much higher death rate than any other country. That fact 
alone is responsible for the magnitude of the correlation. In fact, if Italy is removed, the 
correlation drops to .098, a negligible association. Of course, we could now claim that 
Britain is responsible for the almost zero magnitude of the correlation, and we would be 
right. If we remove Britain from the plot, the correlation is no longer negligible; it jumps 
to .70. You can see how much influence outliers have, sometimes inflating correlations 
and sometimes deflating them. (Of course, the actual relationship between speed limit 
and death rate is complicated by many other factors, a point we discuss later in this 
chapter.) E 


One of the ways in which outliers can occur in a set of data is through erroneous 
recording of the data. Common wisdom among statisticians is that at least 5% of all 
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Figure 11.1 

An example of how 
an outlier can inflate 
correlation 

Source: Rivkin, 1986. 


Table 11.1 Highway Death Rates and Speed Limits 


Death Rate Speed Limit 
Country (Per 100 Million Vehicle Miles) (in Miles Per Hour) 
Norway 3.0 55 
United States 3.3 55 
Finland 3.4 55 
Britain 3.5 70 
Denmark 4.1 55 
Canada 4.3 60 
Japan 4.7 55 
Australia 4.9 60 
Netherlands 5.1 60 
Italy 6.1 75 
Source: Rivkin, 1986. 
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data points are corrupted, either when they are initially recorded or when they are 
entered into the computer. Good researchers check their data using scatterplots, 
stemplots, and other methods to ensure that such errors are detected and corrected. 
However, they do sometimes escape notice, and they can play havoc with numerical 
measures like correlation. 


Figure 11.2 

An example of how 
an outlier can deflate 
correlation 


Source: Adapted from 
Figure 10.1. 
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EXAMPLE 2 Ages of Husbands and Wives 


Figure 11.2 shows a subset of the data we examined in Chapter 10, Figure 10.1, relat- 
ing the ages of husbands and wives in Britain. In addition, an outlier has been added. 
This outlier could easily have occurred in the data set if someone had erroneously en- 
tered one husband's age as 82 when it should have been 28. 

The correlation for the picture as shown is .39, indicating a somewhat low correla- 
tion between husbands’ and wives’ ages. However, the low correlation is completely at- 
tributable to the outlier. When it is removed, the correlation for the remaining points is 
.964, indicating a very strong relationship. a 


Legitimate Outliers, Illegitimate Correlation 


Outliers can also occur as legitimate data, as we saw in the example for which both 
Italy and Britain had much higher speed limits than other countries. However, the 
theory of correlation was developed with the idea that both measurements were from 
bell-shaped distributions, so outliers would be unlikely to occur. As we have seen, 
correlations are quite sensitive to outliers. Be very careful when you are presented 
with correlations for data in which outliers are likely to occur or when correlations 
are presented for a small sample, as shown in Example 3. Not all researchers or re- 
porters are aware of the havoc outliers can play with correlation, and they may in- 
nocently lead you astray by not giving you the full details. 
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EXAMPLE 3 


EXAMPLE 4 


Table 11.2 Major Earthquakes in the Continental United States, 1850-1992 


Date Location Deaths Magnitude 
August 31, 1886 Charleston, SC 60 6.6 
April 18-19,1906 San Francisco, CA 503 8.3 
March 10, 1933 Long Beach, CA 115 6.2 
February 9, 1971 San Fernando Valley, CA 65 6.6 
October 17, 1989 San Francisco area (CA) 62 7.1 
June 28, 1992 Yucca Valley, CA 1 7.5 
January 17, 1994 Northridge, CA 61 6.8 


Source: World Almanac and Book of Facts online, Nov. 2003. 


Earthquakes in the Continental United States 

Table 11.2 lists the major earthquakes that occurred in the continental United States be- 
tween 1850 and 2002. The correlation between deaths and magnitude for these seven 
earthquakes is .689, showing a relatively strong association. This relationship implies 
that, on average, higher death tolls accompany stronger earthquakes. 

However, if you examine the scatterplot of the data shown in Figure 11.3, you will 
notice that the correlation is entirely due to the famous San Francisco earthquake of 
1906. In fact, for the remaining earthquakes, the trend is actually reversed. Without the 
1906 quake, the correlation for these six earthquakes is actually strongly negative, at 
—.92. Higher-magnitude quakes are associated with fewer deaths. 

Clearly, trying to interpret the correlation between magnitude and death toll for 
this small group of earthquakes is a misuse of statistics. The largest earthquake, in 
1906, occurred before earthquake building codes were enforced. The next largest 
quake, with magnitude 7.5, killed only one person but occurred in a very sparsely pop- 
ulated area. E 


The Missing Link: A Third Variable 


Another common mistake that can lead to an illegitimate correlation is combining 
two or more groups when they should be considered separately. The variables for 
each group may actually fall very close to a straight line, but when the groups are 
examined together, the individual relationships may be masked. As a result, it will 
appear that there is very little correlation between the two variables. 

This problem is a variation of “Simpson’s Paradox” for count data, a phenome- 
non we will study in the next chapter. However, statisticians do not seem to be as 
alert to this problem when it occurs with measurement data. When you read that 
two variables have a very low correlation, ask yourself whether data may have been 
combined into one correlation when groups should, instead, have been considered 
separately. 


The Fewer the Pages, the More Valuable the Book? 


If you peruse the bookshelves of a typical college professor, you will find a variety of 
books ranging from textbooks to esoteric technical publications to paperback novels. To 
determine whether the price of a book can be determined by the number of pages it 


Figure 11.3 

A data set for which 
correlation should not 
be used. 

Source: Data from Table 11.2. 
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contains, a college professor recorded the number of pages and price for 15 books on 
one shelf. The numbers are shown in Table 11.3. Is there a relationship between num- 
ber of pages and the price of the book? The correlation for these figures is —.312. The 
negative correlation indicates that the more pages a book has, the less it costs, which is 
certainly a counterintuitive result. 

Figure 11.4 illustrates what has gone wrong. It displays the data in a scatterplot, 
but it also identifies the books by type. The letter H indicates a hardcover book; the let- 
ter S indicates a softcover book. The collection of books on the professor's shelf con- 
sisted of softcover novels, which tend to be long but inexpensive, and hardcover 
technical books, which tend to be shorter but very expensive. If the correlations are cal- 
culated within each type, we find the result we would expect. The correlation between 
number of pages and price is .64 for the softcover books alone, and .35 for the hard- 
cover books alone. Combining the two types of books into one collection not only 
masked the positive association between length and price, but produced an illogical 
negative association. a 


Table 11.3 Pages versus Price for the Books on a Professor’s Shelf 


Pages Price Pages Price Pages Price 


104 32.95 342 49.95 436 5:95 
188 24.95 378 4.95 458 60.00 
220 49.95 385 5.99 466 49.95 
264 79.95 417 4.95 469 5.99 


336 4.50 417 39.75 585 5.95 
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Figure 11.4 
Combining groups 
produces misleading 
correlations (H = hard- 
cover; S = softcover). 
Source: Data from Table 11.3. 
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11.2 Legitimate Correlation Does Not Imply Causation 


EXAMPLE 5 


Even if two variables are legitimately related or correlated, do not fall into the trap 
of believing there is a causal connection between them. Although “correlation does 
not imply causation” is a very well known saying among researchers, relationships 
and correlations derived from observational studies are often reported as if the con- 
nection were causal. 

It is easy to construct silly, obvious examples of correlations that do not result from 
causal connections. For example, a list of weekly tissue sales and weekly hot chocolate 
sales for a city with extreme seasons would probably exhibit a correlation because both 
tend to go up in the winter and down in the summer. A list of shoe sizes and vocabulary 
words mastered by school children would certainly exhibit a correlation because older 
children tend to have larger feet and to know more words than younger children. 

The problem is that sometimes the connections do seem to make sense, and it is 
tempting to treat the observed association as if there were a causal link. Remember 
that data from an observational study, in the absence of any other evidence, simply 
cannot be used to establish causation. 


Happiness and Heart Disease 


News Story 4 in the Appendix and on the CD notes that “heart patients who are happy 
are much more likely to be alive 10 years down the road than unhappy heart patients.” 
Does that mean that if you can somehow force yourself to be happy, it will be good for 
your heart? Maybe, but this research is clearly an observational study. People cannot be 
randomly assigned to be happy or not. 
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The news story provides some possible explanations for the observed relationship 
between happiness and risk of death: 


The experience of joy seems to be a factor. It has physical consequences and also 
attracts other people, making it easier for the patient to receive emotional support. 
Unhappy people, besides suffering from the biochemical effects of their sour 
moods, are also less likely to take their medicines, eat healthy, or to exercise. (p. 9) 


Notice there are several possible confounding variables listed in this quote that may help 
explain why happier people live longer. For instance, it may be the case that whether 
happy or not, people who don’t take their medicine and don’t exercise are likely to die 
sooner, but that unhappy people are more likely to fall into that category. Thus, taking 
one’s medicine and exercising are confounded with the explanatory variable, mood, in 
determining its relationship with the response variable, length of life. a 


EXAMPLE 6 Prostate Cancer and Red Meat 


The February 1994 issue of the University of California at Berkeley Wellness Letter (pp. 2-3) 
reports the results of a study originally published in the October 1993 issue of the Journal 
of the National Cancer Institute. The study followed 48,000 men who had filled out dietary 
questionnaires in 1986. By 1990, 300 of the men had been diagnosed with prostate can- 
cer and 126 had advanced cases. For the advanced cases, the Wellness Letter reported that 
“men who ate the most red meat had a 164% higher risk than those with the lowest in- 
take. Fats from dairy products, fish, and vegetable oils did not increase the risk” (p. 2). 
We may be tempted to believe this indicates that red meat is a contributing cause 
of prostate cancer. But perhaps there is no causal connection at all. One possibility is that 
a third variable both leads men to consume more red meat and increases the risk of 
prostate cancer. One candidate is the hormone testosterone, which has been implicated 
in the growth of prostate cancer. a 


11.3 Some Reasons for Relationships Between Variables 


We have seen numerous examples of variables that are related but for which there is 
probably not a causal connection. To help us understand this phenomenon, let’s exam- 
ine some of the reasons two variables could be related, including a causal connection. 


Some reasons two variables could be related: 


1. The explanatory variable is the direct cause of the response variable. 
2. The response variable is causing a change in the explanatory variable. 


3. The explanatory variable is a contributing but not sole cause of the 
response variable. 


4. Confounding variables may exist. 

5. Both variables may result from a common cause. 
6. Both variables are changing over time. 
7 


. The association may be nothing more than coincidence. 
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EXAMPLE 7 


Reason 1: The explanatory variable is the direct cause of the response variable. 
Occasionally, a change in the explanatory variable is the direct cause of a change in 
the response variable. For example, if we were to measure amount of food consumed 
in the past hour and level of hunger, we would find a relationship. We would proba- 
bly agree that the differences in the amount of food consumed were responsible for 
the difference in levels of hunger. 

Unfortunately, even if one variable is the direct cause of another, we may not see 
a strong association. For example, even though intercourse is the direct cause of 
pregnancy, the relationship between having intercourse and getting pregnant is not 
strong; most occurrences of intercourse do not result in pregnancy. 


Reason 2: The response variable is causing a change in the explanatory vari- 
able. Sometimes the causal connection is the opposite of what might be expected. 
For example, what do you think you would find if you studied hotels and defined the 
response variable as the hotel’s occupancy rate and the explanatory variable as ad- 
vertising sales (in dollars) per room? You would probably expect that higher adver- 
tising expenditures would cause higher occupancy rates. Instead, it turns out that the 
relationship is negative because, when occupancy rates are low, hotels spend more 
money on advertising to try to raise them. Thus, although we might expect higher 
advertising dollars to cause higher occupancy rates, if they are measured at the 
same point in time, we instead find that low occupancy rates cause higher advertis- 
ing revenues. 


Reason 3: The explanatory variable is a contributing but not sole cause of the 
response variable. The complex kinds of phenomena most often studied by re- 
searchers are likely to have multiple causes. Even if there were a causal connection 
between diet and a type of cancer, for instance, it would be unlikely that the cancer 
was caused solely by eating that certain type of diet. It is particularly easy to be mis- 
led into thinking you have found a sole cause for a particular outcome, when what 
you have found is actually a necessary contributor to the outcome. For example, sci- 
entists generally agree that in order to have AIDS, you must be infected with HIV. 
In other words, HIV is necessary to develop AIDS. But it does not follow that HIV 
is the sole cause of AIDS, and there has been some controversy over whether that is 
actually the case. 

Another possibility, discussed in earlier chapters, is that one variable is a con- 
tributory cause of another, but only for a subgroup of the population. If the re- 
searchers do not examine separate subgroups, that fact can be masked, as the next 
example demonstrates. 


Delivery Complications, Rejection, and Violent Crime 


A study summarized in Science (Mann, March 1994) and conducted by scientists at the 
University of Southern California reported a relationship between violent crime and 
complications during birth. The researchers found that delivery complications at birth 
were associated with much higher incidence of violent crime later in life. The data came 
from an observational study of males born in Copenhagen, Denmark, between 1959 
and 1961. 
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However, the connection held only for those men whose mothers rejected them. Re- 
jection meant that the mother had not wanted the pregnancy, had tried to have the fe- 
tus aborted, and had sent the baby to an institution for at least a third of his first year 
of life. Men who were accepted by their mothers did not exhibit this relationship. Men 
who were rejected by their mothers but for whom there were no complications at birth 
did not exhibit the relationship either. In other words, it was the interaction of delivery 
complications and maternal rejection that was associated with higher levels of violent 
crime. 

This example was based on an observational study, so there may not be a causal link 
at all. However, even if there is a causal connection between delivery complications and 
subsequent violent crime, the data suggest that it holds only for a particular subset of 
the population. If the researchers had not measured the additional variable of maternal 
rejection, the data would have erroneously been interpreted as suggesting that the con- 
nection held for all men. a 


Reason 4: Confounding variables may exist. We defined confounding vari- 
ables in Chapter 5, but it is worth reviewing the concept here because it is relevant 
for explaining relationships. Remember that a confounding variable is one that 
has two properties. First, a confounding variable is related to the explanatory vari- 
able in the sense that individuals who differ for the explanatory variable are also 
likely to differ for the confounding variable. Second, a confounding variable af- 
fects the response variable. Thus, both the explanatory and one or more con- 
founding variables may help cause the change in the response variable, but there 
is no way to establish how much is due to the explanatory variable and how much 
is due to the confounding variables. Example 5 in this chapter illustrates the point 
with several possibilities for confounding variables. For instance, people with dif- 
fering levels of happiness (the explanatory variable) may have differing levels of 
emotional support, and emotional support affects one’s will to live. Thus, emo- 
tional support is a confounding variable for the relationship between happiness 
and length of life. 


Reason 5: Both variables may result from a common cause. We have seen nu- 
merous examples in which a change in one variable was thought to be associated 
with a change in the other, but for which we speculated that a third variable was re- 
sponsible. For example, Case Study 6.2 concerned the fact that meditators had lev- 
els of an enzyme normally associated with people of a younger age. We could 
speculate that something in the personality of the meditators caused them to want to 
meditate and also caused them to have lower enzyme levels than others of the 
same age. 

As another example, recall the scatterplot and correlation between verbal SAT 
scores and college GPAs, exhibited in Chapters 9 and 10. We would certainly not 
conclude that higher SAT scores caused higher grades in college, except perhaps for 
a slight benefit of boosted self-esteem. However, we could probably agree that the 
causes responsible for one variable being high (or low) are the same as those re- 
sponsible for the other being high (or low). Those causes would include such factors 
as intelligence, motivation, and ability to perform well on tests. 
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EXAMPLE 8 


EXAMPLE 9 


Do Smarter Parents Have Heavier Babies? 

News Story 18 in the Appendix describes a study that found for babies in the normal 
birth weight range, there was a relationship between birth weight and intelligence in 
childhood and early adulthood. The study was based on a cohort of about 3900 babies 
born in Britain in 1946. But there is a genetic component to intelligence, so smarter par- 
ents are likely to have smarter offspring. The researchers did include mother's education 
and father’s social class in the analysis, to rule them out as possible confounding vari- 
ables. However, there are many other variables that may contribute to birth weight, such 
as mother’s diet and alcohol consumption, for which smarter parents may have provided 
more favorable conditions. Thus, it's possible that heavier birth weight and higher intel- 
ligence in the child both result from the common cause of parents’ intelligence. E 


Reason 6: Both variables are changing over time. Some of the most nonsensi- 
cal associations result from correlating two variables that have both changed over 
time. If they are both changing in a consistent direction, you will indeed see a strong 
correlation, but it may not have any causal link. For example, you would certainly 
see a correlation between winning times in two different Olympic events because 
winning times have all decreased over the years. 

Sociological variables are the ones most likely to be manipulated in this way, as 
demonstrated by the next example, relating increasing divorce rates and increasing 
drug offenses. Watch out for reports of a strong association between two such vari- 
ables, especially when you know that both variables are likely to have had large 
changes over time. 


Divorce Rates and Drug Offenses 

Table 11.4 shows the divorce rates in the United States for various years from 1960 to 
1986, accompanied by the percentage of those admitted to state prisons because of 
drug offenses. The correlation between divorce rate and percentage of criminals admit- 
ted for drug offenses is quite strong, at .67. Based on this correlation, advocates of tra- 
ditional family values may argue that increased divorce rates have resulted in more drug 
offenses. But they both simply reflect a trend across time. The correlation relating year 
and divorce rate is much higher, at .92. Similarly, the correlation between year and per- 
cent admitted for drug offenses is .78. Any two variables that have either both increased 
or both decreased across time will display this kind of correlation. E 


Reason 7: The association may be nothing more than coincidence. Sometimes 
an association between two variables is nothing more than coincidence, even though 
the odds of it happening appear to be very small. For example, suppose a new office 
building opened and within a year there was an unusually high rate of brain cancer 
among workers in the building. Suppose someone calculated that the odds of having 
that many cases in one building were only 1 in 10,000. We might immediately sus- 
pect that something wrong in the environment was causing people to develop brain 
cancer. 

The problem with this reasoning is that it focuses on the odds of seeing such a 
rare event occurring in that particular building in that particular city. It fails to take 
into account the fact that there are thousands of new office buildings. If the odds re- 
ally were only 1 in 10,000, we should expect to see this phenomenon just by chance 
in about 1 of every 10,000 buildings. And that would just be for this particular type 
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Table 11.4 Divorce Rates and Prison Rates for Drug Offenses 


Divorce Rate Percentage Admitted 
Year (per 1000) for Drug Offenses 
1960 2.2 4.2 
1964 2.4 4.1 
1970 35 9.8 
1974 4.6 12.0 
1978 5.2 8.4 
1982 5:1 8.1 
1986 4.8 16.3 


Source: Data for divorce rates are from /nformation Please Almanac, 1991, p. 809. Data for 
the drug offenses are from World Almanac and Book of Facts, 1993, p. 950. 


of cancer. What about clusters of other types of cancer or other diseases? It would 
be unusual if we did not occasionally see clusters of diseases as chance occurrences. 

We will study this phenomenon in more detail in Part 3. For now, be aware that 
a connection of this sort should be expected to occur relatively often, even though 
each individual case has low probability. 


11.4 Confirming Causation 


Given the number of possible explanations for the relationship between two vari- 
ables, how do we ever establish that there actually is a causal connection? It isn’t 
easy. Ideally, in establishing a causal connection, we would change nothing in the 
environment except the suspected causal variable and then measure the result on the 
suspected outcome variable. 

The only legitimate way to try to establish a causal connection statistically is 
through the use of randomized experiments. As we have discussed earlier, in ran- 
domized experiments we try to rule out confounding variables through random as- 
signment. If we have a large sample, and if we use proper randomization, we can 
assume that the levels of confounding variables will be about equal in the different 
treatment groups. This reduces the chances that an observed association is due to 
confounding variables, even those that we have neglected to measure. 


Nonstatistical Considerations 


Evidence of a possible causal connection exists when 
1. There is a reasonable explanation of cause and effect. 
2. The connection happens under varying conditions. 


3. Potential confounding variables are ruled out. 
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If a randomized experiment cannot be done, then nonstatistical considerations must 
be used to determine whether a causal link is reasonable. Following are some fea- 
tures that lend evidence to a causal connection: 


1. There is a reasonable explanation of cause and effect. A potential causal con- 
nection will be more believable if an explanation exists for how the cause and effect 
occur. For instance, in Example 4 in this chapter, we established that for hardcover 
books the number of pages is correlated with the price. We would probably not con- 
tend that higher prices result in more pages, but we could reasonably argue that more 
pages result in higher prices. We can imagine that publishers set the price of a book 
based on the cost of producing it and that the more pages there are, the higher the 
cost of production. Thus, we have a reasonable explanation for how an increase in 
the length of a book could cause an increase in the price. 


2. The connection happens under varying conditions. If many observational studies 
conducted under different conditions all find the same link between two variables, 
the evidence for a causal connection is strengthened. This is especially true if the 
studies are not likely to have the same confounding variables. The evidence is also 
strengthened if the same type of relationship holds when the explanatory variable 
falls into different ranges. 

For example, numerous observational studies have related cigarette smoking 
and lung cancer. Further, the studies have shown that the higher the number of cig- 
arettes smoked, the greater the chances of developing lung cancer; similarly, a 
connection has been established between lung cancer and the age at which smok- 
ing began. These facts make it more plausible that smoking actually causes lung 
cancer. 


3. Potential confounding variables are ruled out. When a relationship first appears 
in an observational study, potential confounding variables may immediately come to 
mind. For example, the researchers in Case Study 6.2, showing the relationship be- 
tween meditation and aging, did consider that vegetarian diets and low alcohol con- 
sumption among many of the meditators may have confounded the results. However, 
they were able to locate other research that failed to find any connection between 
these factors and the enzyme they were measuring. The greater the number of con- 
founding factors that can be ruled out, the more convincing the evidence for a causal 
connection. 


A Final Note 


As you should realize by now, it is very difficult to establish a causal connection be- 
tween two variables by using anything except randomized experiments. Because it 
is virtually impossible to conduct a flawless experiment, potential problems crop up 
even with a well-designed experiment. This means that you should look with skep- 
ticism on claims of causal connections. Having read this chapter, you should have 
the tools necessary for making intelligent decisions and for discovering when an er- 
roneous claim is being made. 
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Exercises Asterisked (*) exercises are included in the Solutions at the back of the book. 


1. 


*2. 


Explain why a strong correlation would be found between weekly sales of fire- 
wood and weekly sales of cough drops over a 1-year period. Would it imply that 
fires cause coughs? 


Suppose a study of employees at a large company found a negative correlation 
between weight and distance walked on an average day. In other words, people 
who walked more weighed less. Would you conclude that walking causes lower 
weight? Can you think of another potential explanation? 


. An article in Science News (1 June 1996, 149, p. 345) claimed that “evidence 


suggests that regular consumption of milk may reduce a person’s risk of stroke, 
the third leading cause of death in the United States.” The claim was based on 
an observational study of 3150 men, and the article noted that the researchers 
“report strong evidence that men who eschew milk have more than twice the 
stroke risk of those who drink 1 pint or more daily.” The article concluded by 
noting that “those who consumed the most milk tended to be the leanest and the 
most physically active.” Go through the list of seven “reasons two variables may 
be related,” and discuss each one in the context of this study. 


. Iman (1994, p. 505) presents data on how college students and experts perceive 


risks for 30 activities or technologies. Each group ranked the 30 activities. The 
rankings for the eight greatest risks, as perceived by the experts, are shown in 
Table 11.5. 


a. Prepare a scatterplot of the data, with students’ ranks on the vertical axis and 
experts’ ranks on the horizontal axis. 


b. The correlation between the two sets of ranks is .407. Based on your scat- 
terplot in part a, do you think the correlation would increase or decrease if 
X rays were deleted? Explain. What if pesticides were deleted instead? 


c. Another technology listed was nuclear power, ranked first by the students 
and 20th by the experts. If nuclear power was added to the list, do you think 
the correlation between the two sets of rankings would increase or decrease? 
Explain. 


Table 11.5 The Eight Greatest Risks 


Activity or Technology Experts’ Rank Students’ Rank 


Motor vehicles 
Smoking 

Alcoholic beverages 
Handguns 

Surgery 
Motorcycles 

X rays 

Pesticides 


= 
ANODO- NNW 


ONAUOBRWN = 
a 


Source: Iman, 1994, p. 505. 
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*5, Give an example of two variables that are likely to be correlated because they 


6. 


al 


8. 


*10. 


12. 


are both changing over time. 


Which one of the seven reasons for relationships listed in Section 11.3 is sup- 
posed to be ruled out by designed experiments? 


Refer to Case Study 10.2, in which students reported their ideal and actual 
weights. When males and females are not separated, the regression equation is 


ideal = 8.0 + 0.9 actual 


a. Draw the line for this equation and the line for the equation ideal = actual 
on the same graph. Comment on the graph as compared to those shown in 
Figures 10.7 and 10.8, in terms of how the regression line differs from the 
line where ideal and actual weights are the same. 


b. Calculate the ideal weight based on the combined regression equation and 
the ideal weight based on separate equations, for individuals whose actual 
weight is 150 pounds. Recall that the separate equations were 


For women: ideal = 43.9 + 0.6 actual 
For men: ideal = 52.5 + 0.7 actual 


c. Comment on the conclusion you would make about individuals weighing 
150 pounds if you used the combined equation compared with the conclu- 
sion you would make if you used the separate equations. 


*d. Explain which of the problems identified in this chapter has been uncovered 
with this example. 


Suppose a study measured total beer sales and number of highway deaths for 1 
month in various cities. Explain why it would make sense to divide both vari- 
ables by the population of the city before determining whether a relationship ex- 
ists between them. 


. Construct an example of a situation where an outlier inflates the correlation be- 


tween two variables. Draw a scatterplot. 


Construct an example of a situation where an outlier deflates the correlation be- 
tween two variables. Draw a scatterplot. 


. According to The Wellness Encyclopedia (University of California, 1991, p.17): 


“Alcohol consumed to excess increases the risk of cancer of the mouth, pharynx, 
esophagus, and larynx. These risks increase dramatically when alcohol is used 
in conjunction with tobacco.” It is obviously not possible to conduct a designed 
experiment on humans to test this claim, so the causal conclusion must be based 
on observational studies. Explain three potential additional pieces of informa- 
tion that the authors may have used to lead them to make a causal conclusion. 


Suppose a positive relationship had been found between each of the following 
sets of variables. In Section 11.3, seven potential reasons for such relationships 
are given. Explain which of the seven reasons is most likely to account for the 
relationship in each case. If you think more than one reason might apply, men- 
tion them all but elaborate on only the one you think is most likely. 


13. 


*14, 


15. 


16. 


*17. 


18. 
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a. Number of deaths from automobiles and beer sales for each year from 1950 
to 1990. 


b. Number of ski accidents and average wait time for the ski lift for each day 
during one winter at a ski resort. 


c. Stomach cancer and consumption of barbecued foods, which are known to 
contain carcinogenic (cancer-causing) substances. 


d. Self-reported level of stress and blood pressure. 
e. Amount of dietary fat consumed and heart disease. 


f. Twice as many cases of leukemia in a new high school, built near a power 
plant, than at the old high school. 


Explain why it would probably be misleading to use correlation to express the 
relationship between number of acres burned and number of deaths for major 
fires in the United States. 


It is said that a higher proportion of drivers of red cars are given tickets for traf- 
fic violations than the drivers of any other color car. Does this mean that if you 
drove a red car rather than a white car, you would be more likely to receive a 
ticket for a traffic violation? Explain. 


Construct an example for which correlation between two variables is masked by 
grouping over a third variable. 


An article in the Davis (CA) Enterprise (5 April 1994) had the headline “Study: 
Fathers key to child’s success.” The article described a new study as follows: 
“The research, published in the March issue of the Journal of Family Psychol- 
ogy, found that mothers still do a disproportionate share of child care. But sur- 
prisingly, it also found that children who gain the ‘acceptance’ of their fathers 
make better grades than those not close to their dads.” The article implies a 
causal link, with gaining father’s acceptance (the explanatory variable) resulting 
in better grades (the response variable). Choosing from the remaining six possi- 
bilities in Section 11.3 (reasons 2 through 7), give three other potential expla- 
nations for the observed connection. 


Lave (1990) discussed studies that had been done to test the usefulness of seat 
belts before and after their use became mandatory. One possible method of 
testing the usefulness of mandatory seat belt laws is to measure the number of 
fatalities in a particular region for the year before and the year after the law 
went into effect and to compare them. If such a study were to find substan- 
tially reduced fatalities during the year after the law went into effect, could it 
be claimed that the mandatory seat belt law was completely responsible? Ex- 
plain. (Hint: Consider factors such as weather and the anticipatory effect of 
the law.) 


In Case Study 10.1, we learned how psychologists relied on twins to measure 
the contributions of heredity to various traits. Suppose a study were to find that 
identical (monozygotic) twins had highly correlated scores on a certain trait but 
that pairs of adult friends did not. Why would that not be sufficient evidence to 
conclude that genetic factors were responsible for the trait? 
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19. An article in The Wichita Eagle (24 June 2003, p. 4A) read as follows: 


Scientists have analyzed autopsy brain tissue from members of a religious 
order who had an average of 18 years of formal education and found that 
the more years of schooling, the less likely they were to exhibit Alzheimer’s 
symptoms of dementia. The study provides the first biological proof of edu- 
cation’s possible protective effect. 


Do you agree with the last sentence, that this study provides proof that education has 
a protective effect? Explain. 


For Exercises 20 to 24, refer to the news stories in the Appendix and corresponding 
original source on the CD. In each case, identify which of the “Reasons for Rela- 
tionships between Variables” described in Section 11.3 are likely to apply. 


20. 
21. 
*22. 
23. 
24. 


25. 


News Story 10: “Churchgoers live longer, study finds.” 

News Story 12: “Working nights may increase breast cancer risk.” 
News Story 15: “Kids’ stress, snacking linked.” 

News Story 16: “More on TV Violence.” 


News Story 20: “Eating Organic Foods Reduces Pesticide Concentrations in 
Children.” 


The following are titles of some of the news stories in the Appendix. In each 
case, determine whether the study was a randomized experiment or an observa- 
tional study, then discuss whether the title is justified based on the way the study 
was done. 


a. News Story 3: “Rigorous veggie diet found to slash cholesterol.” 
b. News Story 8: “Education, kids strengthen marriage.” 

c. News Story 10: “Churchgoers live longer, study finds.” 

d. News Story 15: “Kids’ stress, snacking linked.” 
e 


. News Story 20: “Eating Organic Foods Reduces Pesticide Concentrations in 
Children.” 


Mini-Projects 


. Find a newspaper or journal article that describes an observational study in 


which the author’s actual goal is to try to establish a causal connection. Read the 
article, and then discuss how well the author has made a case for a causal con- 
nection. Consider the factors discussed in Section 11.4 and discuss whether they 
have been addressed by the author. Finally, discuss the extent to which the au- 
thor has convinced you that there is a causal connection. 


Peruse journal articles and find two examples of scatterplots for which the au- 
thors have computed a correlation that you think is misleading. For each case, 
explain why you think it is misleading. 
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Categorical Variables 


Thought Questions 


1. 


Students in a statistics class were asked whether they preferred an in-class or a take- 
home final exam and were then categorized as to whether they had received an A 
on the midterm. Of the 25 A students, 10 preferred a take-home exam, whereas of 
the 50 non-A students, 30 preferred a take-home exam. How would you display 
these data in a table? 


. Suppose a news article claimed that drinking coffee doubled your risk of developing 


a certain disease. Assume the statistic was based on legitimate, well-conducted re- 
search. What additional information would you want about the risk before deciding 
whether to quit drinking coffee? (Hint: Does this statistic provide any information on 
your actual risk?) 


. A study to be discussed in detail in this chapter classified pregnant women according 
to whether they smoked and whether they were able to get pregnant during the first 
cycle in which they tried to do so. What do you think is the question of interest? At- 
tempt to answer it. Here are the results: 

Pregnancy Occurred After 
First Cycle Two or More Cycles Total 
Smoker 29 71 100 
Nonsmoker 198 288 486 
Total 227 359 586 
. A recent study estimated that the “relative risk” of a woman developing lung cancer 


if she smoked was 27.9. What do you think is meant by the term relative risk? 
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12.1 Displaying Relationships Between Categorical 
Variables: Contingency Tables 


EXAMPLE 1 


Summarizing and displaying data resulting from the measurement of two categori- 
cal variables is easy to do: Simply count the number of individuals who fall into each 
combination of categories, and present those counts in a table. Such displays are of- 
ten called contingency tables because they cover all contingencies for combinations 
of the two variables. Each row and column combination in the table is called a cell. 

In some cases, one variable can be designated as the explanatory variable and the 
other as the response variable. In these cases, it is conventional to place the ex- 
planatory variables down along the side of the table (as labels for the rows) and the 
response variables along the top of the table (as labels for the columns). This makes 
it easier to display the percentages of interest. 


Aspirin and Heart Attacks 


In Case Study 1.2, we discussed an experiment in which there were two categorical 
variables: 


variable A = explanatory variable = aspirin or placebo 
variable B = response variable = heart attack or no heart attack 


Table 12.1 illustrates the contingency table for the results of this study. Notice that the 
explanatory variable (whether the individual took aspirin) is the row variable, whereas 
the response variable (whether the person had a heart attack) is the column variable. 
There are four cells, one representing each combination of treatment and outcome. m 


Conditional Percentages and Rates 


It’s difficult to make useful comparisons from a contingency table (unless the num- 
ber of individuals under each condition is the same) without doing further calcula- 
tions. Usually, the question of interest is whether the percentages in each category of 
the response variable change when the explanatory variable changes. 

In Example 1, the question of interest is whether the percentage of heart attack 
sufferers differs for the people who took aspirin and the people who took a placebo. 
In other words, is the percentage of people who fall into the first column (heart at- 
tack) the same for the two rows? We can calculate the conditional percentages for 


Table 12.1 Heart Attack Rates After Taking Aspirin or Placebo 


Heart Attack No Heart Attack Total 
Aspirin 104 10,933 11,037 
Placebo 189 10,845 11,034 


Total 293 21,778 22,071 
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EXAMPLE 2 


the response variable by looking separately at each category of the explanatory vari- 
able. Thus, in our example, we have two conditional percentages: 


Aspirin group: The percentage who had heart attacks was 104/11,037 = 
0.0094 = 0.94%. 


Placebo group: The percentage who had heart attacks was 189/11,034 = 
0.0171 = 1.71%. 


Sometimes, for rare events like these heart attack numbers, percentages are so small 
that it is easier to interpret a rate. The rate is simply stated as the number of indi- 
viduals per 1000 or per 10,000 or per 100,000, depending on what’s easiest to inter- 
pret. Percentage is equivalent to a rate per 100. 

Table 12.2 presents the data from Example 1, but also includes the conditional 
percentages and the rates of heart attacks per 1000 individuals for the two groups. 
Notice that the rate per 1000 is easier to understand than the percentages. 


Young Drivers, Gender, and Driving 
Under the Influence of Alcohol 


In Case Study 6.3, we learned about a court case challenging an Oklahoma law that dif- 
ferentiated the ages at which young men and women could buy 3.2% beer. The 
Supreme Court had examined evidence from a “random roadside survey” that mea- 
sured information on age, gender, and drinking behavior. In addition to the data pre- 
sented in Case Study 6.3, the roadside survey measured whether the driver had been 
drinking alcohol in the previous 2 hours. Table 12.3 gives the results for the drivers un- 
der 20 years of age. 

The Supreme Court concluded that “the showing offered by the appellees does not 
satisfy us that sex represents a legitimate, accurate proxy for the regulation of drinking 
and driving” (Gastwirth, 1988, p. 527). Notice the difference in the percentages of young 
men and women who had been drinking alcohol, with the percentage slightly higher for 
males. However, in the next chapter we will see that we cannot rule out chance as a rea- 
sonable explanation for this difference. In other words, if there really is no difference 
among the percentages of young male and female drivers in the population who drink 
and drive, we still could reasonably expect to see a difference as large as the one observed 
in a sample of this size. Using the language introduced in Chapter 10, this means that 
the observed difference in percentages is not statistically significant. E 


Table 12.2 Data for Example 1 with Percentage and Rate Added 


Heart No Heart Heart Rate per 

Attack Attack Total Attacks (%) 1000 
Aspirin 104 10,933 11,037 0.94 9.4 
Placebo 189 10,845 11,034 1:71 17.1 


Total 293 21,778 22,071 


EXAMPLE 3 
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Table 12.3 Results of Roadside Survey for Young Drivers 


Drank Alcohol in Last 2 Hours? 


Yes No Total Percentage Who Drank 
Males 77 404 481 16.0% 
Females 16 122 138 11.6% 
Total 93 526 619 15.0% 


Source: Gastwirth, 1988, p. 526. 


Ease of Pregnancy for Smokers and Nonsmokers 


In a retrospective observational study, researchers asked women who were pregnant 
with planned pregnancies how long it took them to get pregnant (Baird and Wilcox, 
1985; see also Weiden and Gladen, 1986). Length of time to pregnancy was measured 
according to the number of cycles between stopping birth control and getting pregnant. 
Women were also categorized on whether they smoked, with smoking defined as hav- 
ing at least one cigarette per day for at least the first cycle during which they were try- 
ing to get pregnant. 
For our purposes, we will classify the women on two categorical variables: 


variable A = explanatory variable = smoker or nonsmoker 
variable B = response variable = pregnant in first cycle or not 


The question of interest is whether the same percentages of smokers and nonsmokers 
were able to get pregnant during the first cycle. We present the contingency table and 
the percentages in Table 12.4. 

As you can see, a much higher percentage of nonsmokers than smokers were able 
to get pregnant during the first cycle. Because this is an observational study, we cannot 
conclude that smoking caused a delay in getting pregnant. We merely notice that there 
is a relationship between smoking status and time to pregnancy, at least for this sample. 
It is not difficult to think of potential confounding variables. E 


Table 12.4 Time to Pregnancy for Smokers and Nonsmokers 


Pregnancy Occurred After 


Two or More Percentage in 
First Cycle Cycles Total First Cycle 
Smoker 29 71 100 29% 
Nonsmoker 198 288 486 41% 
Total 227 359 586 
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12.2 Relative Risk, Increased Risk, and Odds 


Various measures are used to report the chances of a particular outcome and how the 
chances increase or decrease with changes in an explanatory variable. Here are some 
quotes that use different measures to report chance: 


m “What they found was that women who smoked had a risk [of getting lung can- 
cer] 27.9 times as great as nonsmoking women; in contrast, the risk for men who 
smoked regularly was only 9.6 times greater than that for male nonsmokers.” 
(Taubes, 26 November 1993, p. 1375) 


“Clinically depressed people are at a 50 percent greater risk of killing themselves.” 
(Newsweek, 18 April 1994, p. 48) 


“On average, the odds against a high school player playing NCAA football are 25 
to 1. But even if he’s made his college team, his odds are a slim 30 to | against 
being chosen in the NFL draft.” (Krantz, 1992, p. 107) 


Risk, Probability, and Odds 


There are just two basic ways to express the chances that a randomly selected indi- 
vidual will fall into a particular category for a categorical variable. The first of the 
two methods involves expressing one category as a proportion of the total; the 
other involves comparing one category to another category in the form of relative 
odds. 

Suppose a population contains 1000 individuals, of which 400 carry the gene for 
a disease. The following are all equivalent ways to express this proportion: 


Forty percent (40%) of all individuals carry the gene. 
The proportion who carry the gene is 0.40. 
The probability that someone carries the gene is .40. 


The risk of carrying the gene is 0.40. 


However, to express this in odds requires a different calculation. The equivalent 
statement represented in odds would be: 


The odds of carrying the gene are 4 to 6 (or 2 to 3, or 2/3 to 1). 


The odds are usually expressed by reducing the numbers with and without 
the trait to the smallest whole numbers possible. Thus, we would say that the odds 
are 2 to 3, rather than saying they are 2/3 to 1. Both formulations would be 
correct. 

The general forms of these expressions are as follows: 
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number with trait 
Percentage with the trait = e ai x 100% 
ota 


number with trait 


Proportion with the trait = 
total 


number with trait 


Probability of having the trait = 
total 


number with trait 


Risk of having the trait = 
total 
number with trait 
Odds of having the trait = —————_——— to 1 
number without trait 


Calculating the odds from the proportion and vice versa is a simple operation. 


If p is the proportion who have the trait, then the odds of having it are 
p/tl = p) tL. 


If the odds of having the trait are a to b, then the proportion who have it is 
a/(a + b). 


For example, if the proportion carrying a certain gene is 0.4, then the odds of hav- 
ing it are (0.4/0.6) to 1, or 2/3 to 1, or 2 to 3. Going in the other direction, if the 
odds of having it are 2 to 3, then the proportion who have it is 2/(2 + 3) = 2/5 = 
4/10 = 0.40. 


Baseline Risk 


When there is a treatment or behavior for which researchers want to study risk, they 
often compare it to the baseline risk, which is the risk without the treatment or be- 
havior. For instance, in determining whether aspirin helps prevent heart attacks, the 
baseline risk is the risk of having a heart attack without taking aspirin. In studying 
the risk of smoking and getting lung cancer, the baseline risk is the risk of getting 
lung cancer without smoking. 

In practice, the baseline risk can be difficult to find. When researchers include a 
placebo as a treatment, the risk for the group taking the placebo is utilized as the base- 
line risk. Of course, the baseline risk depends on the population being studied as well. 
For instance, the risk of having a heart attack without taking daily aspirin differs for 
men and women, for people in different age groups, for differing levels of exercise, and 
so on. That’s why it’s important to include a placebo or a control group as similar as 
possible to the treatment group in studies assessing the risk of traits or behaviors. 


224 


PART2 Finding Life in Data 


EXAMPLE 4 


Relative Risk 


The relative risk of an outcome for two categories of an explanatory variable is sim- 
ply the ratio of the risks for each category. The relative risk is often expressed as a 
multiple. For example, a relative risk of 3 may be reported by saying that the risk of 
developing a disease for one group is three times what it is for another group. No- 
tice that a relative risk of 1 would mean that the risk is the same for both categories 
of the explanatory variable. 

It is often of interest to compare the risk of disease for those with a certain trait 
or behavior to the baseline risk of that disease. In that case, the relative risk usually 
is aratio, with the risk for the trait of interest in the numerator and the baseline risk 
in the denominator. However, there is no hard-and-fast rule, and if the trait or be- 
havior decreases the risk, as taking aspirin appears to do for heart attacks, the base- 
line risk is often used as the numerator. In general, relative risks greater than | are 
easier to interpret than relative risks between 0 and 1. 


Relative Risk of Developing Breast Cancer 
Pagano and Gauvreau (1993, p. 133) reported data for women participating in the first 
National Health and Nutrition Examination Survey (Carter, Jones, Schatzkin, and Brinton, 
January-February, 1989). The explanatory variable was whether the age at which a 
woman gave birth to her first child was 25 or older, and the outcome variable was 
whether she developed breast cancer (see Table 12.5). 

To compute the relative risk of developing breast cancer based on whether the age 
at which a woman had her first child was 25 or older, we first find the risk of breast can- 
cer for each group: 


m Risk for women having first child at age 25 or older = 31/1628 = 0.0190 

m Risk for women having first child before age 25 = 65/4540 = 0.0143 

m Relative risk = 0.0190/0.0143 = 1.33 

We can also represent this by saying that the risk of developing breast cancer is 1.33 


times greater for women who had their first child at age 25 or older than for those who 
did not. o 


Notice the direction for which the relative risk was calculated, which was to put 
the group with lower risk in the denominator. As noted, this is common practice be- 
cause it’s easier for most people to interpret the results in that direction. For the cur- 
rent example, the relative risk in the other direction would be 0.75. In other words, 


Table 12.5 Age at Birth of First Child and Breast Cancer 


First Child at Age 25 or Older? Breast Cancer No Breast Cancer Total 


Yes 31 1597 1628 
No 65 4475 4540 
Total 96 6072 6168 


Source: Pagano and Gauvreau, 1993. 


EXAMPLE 5 
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the risk of developing breast cancer is 0.75 as much for women who have had their 
first child before age 25 as it is for women who have not. You can see that this rela- 
tive risk statistic is more difficult to read than the relative risk of 1.33 presented in 
the example. In this case, the risk of breast cancer for women who have their first 
child before they are 25 can also be considered as the baseline risk. Waiting until age 
25 or later increases the risk from that baseline, and it makes sense to present the rel- 
ative risk in that direction. 


Increased Risk 


Sometimes the change in risk is presented as a percentage increase instead of a mul- 
tiple. The percent increase in risk is calculated as follows: 


change in risk 


increased risk = x 100% 


baseline risk 
An equivalent way to compute the increased risk is 
increased risk = (relative risk — 1.0) X 100% 


If there is no obvious baseline risk, then the denominator for the increased risk is 
whatever was used as the denominator for the corresponding relative risk. 


Increased Risk of Breast Cancer 

The change in risk of breast cancer for women who have not had their first child before 
age 25 compared with those who have is (0.0190 — 0.0143) = 0.0047. Because the 
baseline risk for those who have had a child before age 25 is 0.0143, this change rep- 
resents an increase of (0.0047/0.0143) = 0.329 = 32.9%, or about 33%. The increased 
risk would be reported by saying that there is a 33% increase in the chances of breast 
cancer for women who have not had a child before the age of 25. Notice that this is 
also (relative risk — 1.0) X 100%. E 


Odds Ratio 


Epidemiologists, who study the causes and progression of diseases and other health 
risks, often represent comparative risks using the odds ratio instead of the relative 
risk. If the risk of a disease is small, these two measures will be about the same. The 
relative risk is easier to understand, but the odds ratio is easier to work with statisti- 
cally. Therefore, you will often find the odds ratio reported in journal articles about 
health-related issues. 

To compute the odds ratio, you first compute the odds of getting the disease to 
not getting the disease for each of the two categories of the explanatory variable. You 
then take the ratio of those odds. Let’s compute it for the example concerning the 
risk of breast cancer. 


m Odds for women having first child at age 25 or older = 31/1597 = 0.0194 
m Odds for women having first child before age 25 = 65/4475 = 0.0145 
m Odds ratio = 0.0194/0.0145 = 1.34 
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You can see that the odds ratio of 1.34 is very similar to the relative risk of 1.33. As 
we have noted, this will be the case as long as the risk of disease in each category is 
small. 

There is an easier way to compute the odds ratio, but if you were to simply see 
the formula you might not understand that it was a ratio of the two odds. The for- 
mula proceeds as follows: 


1. Multiply the two numbers in the upper left and lower right cells of the table. 
2. Divide the result by the numbers in the upper right and lower left cells of the 


table. 
For the example we have been studying (Table 12.5), the computation would be as 
follows: 
f 31 X 4475 
odds ratio = ————— = 1.34 
1597 X 65 


Depending on how your table is constructed, you might have to reverse the numer- 
ator and denominator. As with relative risk, it is conventional to construct the odds 
ratio so that it is greater than 1. The only difference is which category of the ex- 
planatory variable gets counted as the numerator of the ratio and which gets counted 
as the denominator. 


Relative Risk and Odds Ratios in Journal Articles 


Journal articles often report relative risk and odds ratios, but rarely in the simple 
form described here. In most studies, researchers measure a number of potential con- 
founding variables. When they compute the relative risk or odds ratio, they “adjust” 
it to account for these confounding variables. For instance, they might report the rel- 
ative risk for getting a certain type of cancer if you eat a high-fat diet, after taking 
into account age, amount of exercise, and whether or not people smoke. 

The statistical methods used for these adjustments are beyond the level of this 
book, but interpreting the results is not. An adjusted relative risk or odds ratio has 
almost the same interpretation as the straightforward versions we have learned. The 
only difference is that you can think of them as applying to two groups for which the 
other variables are held approximately constant. For instance, suppose the relative 
risk for getting a certain type of cancer for those with a high-fat and low-fat diet is 
reported to be 1.3, adjusted for age and smoking status. That means that the relative 
risk applies (approximately) to two groups of individuals of the same age and smok- 
ing status, where one group has a high-fat diet and the other has a low-fat diet. 


EXAMPLE 6 Night Shift Work and Odds for Breast Cancer 


tm» News Story 12, “Working nights may increase breast cancer risk” reported that “women 
who regularly worked night shifts for three years or less were about 40 percent more 
wos likely to have breast cancer than women who did not work such shifts.” Consulting the 
journal article in which this statistic is found reveals that it is not a simple increased risk 

as defined in this chapter. The result is found in Table 3 of the journal article “Night Shift 

Work, Light at Night, and Risk of Breast Cancer,” in a column headed “Odds ratio.” It’s 
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actually an adjusted odds ratio, for women who worked at least one night shift a week 
for up to 3 years. The footnote to the table explains that “odds ratios were adjusted for 
parity [number of pregnancies], family history of breast cancer (mother or sister), oral 
contraceptive use (ever), and recent (<5 years) discontinued use of hormone replace- 
ment therapy” (p. 1561). Also, notice that the news story reports an increased risk of 
40 percent, which would be correct if the relative risk were 1.4. Remember that for most 
diseases the odds ratio and relative risk are very similar. The news report is based on the 
assumption that this is the case for this study. a 


12.3 Misleading Statistics about Risk 


You can be misled in a number of ways by statistics presenting risks. Unfortunately, 
statistics are often presented in the way that produces the best story rather than in the 
way that is most informative. Often, you cannot derive the information you need 
from news reports. 


Common ways the media misrepresent statistics about risk: 
1. The baseline risk is missing. 

2. The time period of the risk is not identified. 

3. The reported risk is not necessarily your risk. 


Missing Baseline Risk 


A study appeared on the front page of the Sacramento Bee on March 8, 1984 with 
the headline “Evidence of new cancer-beer connection” (p. Al). The article reported 
that men who drank 500 ounces or more of beer a month (about 16 ounces a day) 
were three times more likely to develop cancer of the rectum than nondrinkers. If you 
were a beer drinker reading about this study, would it encourage you to reduce your 
beer consumption? 

Although a relative risk of three times sounds ominous, it is not much help in 
making lifestyle decisions without also having information about what the risk is 
without drinking beer, the baseline risk. If a threefold risk increase means that your 
chances of developing this cancer go from 1 in 100,000 to 3 in 100,000, you are 
much less likely to be concerned than if it means your risk jumps from 1 in 10 to 3 
in 10. When a study reports relative risk, it should always give you a baseline risk 
as well. 

In fairness, this article did report an estimate from the American Cancer Society 
that there are about 40,000 new cases of rectal cancer in the United States each year. 
Further, it gave enough information so that one could derive the fact that there were 
about 3600 non-beer drinkers in the study and 20 of them developed this cancer, for 
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a baseline risk of about 0.0056, or about 1 in 180. Therefore, we can surmise that 
among those who drank more than 500 ounces of beer a month, the risk was about 
3 in 180, or 1 in 60. Remember that because these results were based on an obser- 
vational study, we cannot conclude that drinking beer actually caused the greater ob- 
served risk. For instance, there may be confounding dietary factors. 


Risk over What Time Period? 


“Italian scientists report that a diet rich in animal protein and fat—cheeseburgers, 
french fries, and ice cream, for example—increases a woman’s risk of breast cancer 
threefold,” according to Prevention Magazine’s Giant Book of Health Facts (1991, 
p. 122). Couple this with the fact that the American Cancer Society estimates that 1 
in 9 women in the United States will get breast cancer. Does that mean that if a 
woman eats a diet rich in animal protein and fat, her chances of developing breast 
cancer are | in 3? 

There are two problems with this line of reasoning. First, the statement attributed 
to the Italian scientists was woefully incomplete. It did not specify anything about 
how the study was conducted. It also did not specify the ages of the women studied 
or what the baseline rate of breast cancer was for the study. Why would we need to 
know the baseline rate for the study when we already know that | in 9 women will 
develop breast cancer? The answer is that age is a critical factor. The baseline rate of 
1 in 9 is a lifetime risk, at least to age 85. As with most diseases, accumulated risk 
increases with age. 

According to the University of California at Berkeley Wellness Letter (July 1992, 
p. 1), the lifetime risk of a woman developing breast cancer by certain ages is 


by age 50: 1 in 50 

by age 60: 1 in 23 

by age 70: 1 in 13 

by age 80: 1 in 10 

by age 85: 1 in9 
The annual risk of developing breast cancer is only about 1 in 3700 for women in 
their early 30s but is 1 in 235 for women in their early 70s (Fletcher, Black, Harris, 
Rimer, and Shapiro, 20 October 1993, p. 1644). If the Italian study had been done 
on very young women, the threefold increase in risk could represent a small in- 
crease. Unfortunately, Prevention Magazine’s Giant Book of Health Facts did not 


give even enough information to lead us to the original report of the work. There- 
fore, it is impossible to intelligently evaluate the claim. 


Reported Risk versus Your Risk 


The headline was enough to make you want to go out and buy a new car: “Older cars 
stolen more often than new ones” [Davis (CA) Enterprise, 15 April 1994, p. C3]. 
The article reported that “among the 20 most popular auto models stolen [in Cali- 
fornia] last year, 17 were at least 10 years old.” 
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Suppose you own two cars; one is 15 years old and the other is new. You park 
them both on the street outside of your home. Are you at greater risk of having the 
old one stolen? Perhaps, but the information quoted in the article gives you no in- 
formation about that question. 

Numerous factors determine which cars are stolen. We can easily speculate that 
many of those factors are strongly related to the age of cars as well. Certain neigh- 
borhoods are more likely to be targeted than others, and those same neighborhoods 
are probably more likely to have older cars parked in them. Cars parked in locked 
garages are less likely to be stolen and are more likely to be newer cars. Cars with 
easily opened doors are more likely to be stolen and more likely to be old. Cars that 
are not locked and/or don’t have alarm systems are more likely to be stolen and are 
more likely to be old. Cars with high value for used parts are more likely to be stolen 
and are more likely to be old, discontinued models. 

You can see that the real question of interest to a consumer is, “If I were to buy 
a new car, would my chances of having it stolen increase or decrease over those of 
the car I own now?” That question can’t be answered based only on information 
about which cars have been stolen most often. Simply too many variables are related 
to both the age of the car and its risk of being stolen. 


12.4 Simpson’s Paradox: The Missing Third Variable 


EXAMPLE 7 


In Chapter 11, we saw an example where omitting a third variable masked the pos- 
itive correlation between number of pages and price of books. A similar phenome- 
non can happen with categorical variables, and it goes by the name of Simpson’s 
Paradox. It is a paradox because the relationship appears to be in one direction if 
the third variable is not considered and in the other direction if it is. 


Simpson’s Paradox for Hospital Patients 


We illustrate Simpson’s Paradox with a hypothetical example of a new treatment for a 
disease. Suppose two hospitals are willing to participate in an experiment to test the 
new treatment. Hospital A is a major research facility, famous for its treatment of ad- 
vanced cases of the disease. Hospital B is a local area hospital in an urban area. 

Both hospitals agree to include 1100 patients in the study. Because the researchers 
conducting the experiment are on the staff of Hospital A, they decide to perform the 
majority of cases with the new procedure in-house. They randomly assign 1000 patients 
to the new treatment, with the remaining 100 receiving the standard treatment. Hospi- 
tal B, which is a bit reluctant to try something new on too many patients, agrees to ran- 
domly assign 100 patients to the new treatment, leaving 1000 to receive the standard. 
The numbers who survived and died with each treatment in each hospital are shown in 
Table 12.6. 

Table 12.7 shows how well the new procedure worked compared with the standard. 
It looks as though the new treatment is a success. The risk of dying from the standard 
procedure is higher than that for the new procedure in both hospitals. In fact, the 
risk of dying when given the standard treatment is an overwhelming 10 times higher 
than it is for the new treatment in Hospital B. In hospital A the risk of dying with the 
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Table 12.6 Survival Rates for Standard and New 
Treatments at Two Hospitals 


Hospital A Hospital B 
Survive Die Total Survive Die Total 
Standard 5 95 100 500 500 1000 
New 100 900 1000 95 5 100 
Total 105 995 1100 595 505 1100 


Table 12.7 Risk Compared for Standard and New Treatments 


Hospital A Hospital B 
Risk of dying with the standard 
treatment 95/100 = 0.95 500/1000 = 0.50 
Risk of dying with the new treatment 900/1000 = 0.90 5/100 = 0.05 


Relative risk 0.95/0.90 = 1.06 0.50/0.05 = 10.0 


standard treatment is only 1.06 times higher than with the new treatment, but it is 
nonetheless higher. 

The researchers would now like to estimate the overall reduction in risk for the new 
treatment, so they combine all of the data (Table 12.8). 

What has gone wrong? It now looks as though the standard treatment is superior 
to the new one! In fact, the relative risk, taken in the same direction as before, is 
0.54/0.82 = 0.66. The death rate for the standard treatment is only 66% of what it is 
for the new treatment. How can that be true, when the death rate for the standard 
treatment was higher than for the new treatment in both hospitals? 

The problem is that the more serious cases of the disease presumably were treated 
by the famous research hospital, Hospital A. Because they were more serious cases, they 
were more likely to die. But because they went to Hospital A, they were also more likely 
to receive the new treatment. When the results from both hospitals are combined, we 
lose the information that the patients in Hospital A had both a higher overall death rate 
and a higher likelihood of receiving the new treatment. The combined information is 
quite misleading. E 


Simpson’s Paradox makes it clear that it is dangerous to summarize information 
over groups, especially if patients (or experimental units) were not randomized into 
the groups. Notice that if patients had been randomly assigned to the two hospitals, 
this phenomenon probably would not have occurred. It would have been unethical, 
however, to do such a random assignment. 

If someone else has already summarized data for you by collapsing three vari- 
ables into two, you cannot retrieve the information to see whether Simpson’s Para- 
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Table 12.8 Estimating the Overall Reduction in Risk 


Survive Die Total Risk of Death 
Standard 505 595 1100 595/1100 = 0.54 
New 195 905 1100 905/1100 = 0.82 
Total 700 1500 2200 


dox has occurred. Common sense should help you detect this problem in some cases. 
When you read about a relationship between two categorical variables, try to find out 
if the data have been collapsed over a third variable. If so, think about whether sep- 
arating results for the different categories of the third variable could change the di- 
rection of the relationship between the first two. Exercise 14 at the end of this 
chapter presents an example of this. 


Assessing Discrimination in Hiring and Firing 


The term relative risk is obviously not applicable to all types of data. It was devel- 
oped for use with medical data where risk of disease or injury are of concern. An 
equivalent measure used in discussions of employment is the selection ratio, which 
is the ratio of the proportion of successful applicants for a job from one group (sex, 
race, and so on) compared with another group. For example, suppose a company 
hires 10% of the men who apply and 15% of the women. Then the selection ratio for 
women compared with men is 15/10 = 1.50. Comparing this with our discussion of 
relative risk, it says that women are 1.5 times as likely to be hired as men. The ratio 
is often used in the reverse direction when arguing that discrimination has occurred. 
For instance, in this case it might be argued that men are only 10/15 = 0.67 times 
as likely to be hired as women. Gastwirth (1988, p. 209) explains that government 
agencies in the United States have set a standard for determining whether there is po- 
tential discrimination in practices used for hiring. “If the minority pass (hire) rate is 
less than four-fifths (or 0.8) of the majority rate, then the practice is said to have a 
disparate or disproportionate impact on the minority group, and employers are re- 
quired to justify its job relevance.” In the case where 10% of men and 15% of women 
who apply are hired, the men would be the minority group. The selection ratio of 
men to women would be 10/15 = 0.67, so the hiring practice could be examined for 
potential discrimination. 

Unfortunately, as Gastwirth and Greenhouse (1995) argue, this rule may not be 
as clear as it needs to be. They present a court case contesting the fairness of layoffs 
by the U.S. Labor Department, in which both sides in the case tried to interpret the 
rule in their favor. The layoffs were concentrated in Labor Department offices in the 
Chicago area, and the numbers are shown in Table 12.9. 

If we consider the selection ratio based on people who were laid off, it should 
be clear that the four-fifths rule was violated. The percentage of whites who were 
laid off compared to the percentage of African Americans who were laid off is 
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Table 12.9 Layoffs by Ethnic Group for Labor Department Employees 


Laid Off? 
Ethnic Group Yes No Total % Laid Off 
African American 130 1382 1512 8.6 
White 87 2813 2900 3.0 
Total 217 4195 4412 


Data Source: Gastwirth and Greenhouse, 1995. 


3.0/8.6 = 0.35, clearly less than the 0.80 required for fairness. However, the defense 
argued that the selection ratio should have been computed using those who were re- 
tained rather than those who were laid off. Because 91.4% of African Americans and 
97% of whites were retained, the selection ratio is 91.4/97 = 0.94, well above the 
ratio of 0.80 required to be within acceptable practice. As for which claim was sup- 
ported by the court, Gastwirth and Greenhouse (1995, p. 1642) report: “The lower 
court accepted the defendant’s claim [using the selection ratio for those retained] but 
the appellate opinion, by Judge Cudahy, remanded the case for reconsideration.” The 
judge also asked for further statistical information, to rule out chance as an explana- 
tion for the difference. 

The issue of whether or not the result is statistically significant, meaning that 
chance is not a reasonable explanation for the difference, will be considered in Chap- 
ter 13. However, notice that we must be careful in its interpretation. The people in 
this study are not a random sample from a larger population; they are the only em- 
ployees of concern. Therefore, it may not make sense to talk about whether the re- 
sults represent a real difference in some hypothetical larger population. 

Gastwirth and Greenhouse point out that the discrepancy between the selection 
ratio for those laid off versus those retained could have been avoided if the odds ra- 
tio had been used instead of the selection ratio. The odds ratio compares the odds 
of being laid off to the odds of being retained for each group. Therefore, the plain- 
tiffs and defendants could not manipulate the statistics to get two different answers. 
The odds ratio for this example can be computed using the simple formula 


130 X 2813 _ 
1382 X 87 


odds ratio = 3.04 


This number tells us that the odds of being laid off compared with being retained are 
three times higher for African Americans than for whites. Equivalently, the odds ra- 
tio in the other direction is 1/3.04, or about 0.33. It is this figure that should be as- 
sessed using the four-fifths rule. Gastwirth and Greenhouse argue that “should courts 
accept the odds ratio measure, they might wish to change the 80% rule to about 67% 
or 70% since some cases that we studied and classified as close, had ORs [odds ra- 
tios] in that neighborhood” (1995, p. 1642). 

Finally, Gastwirth and Greenhouse argue that the selection ratio may still be ap- 
propriate in some cases. For example, some employment practices require applicants 
to meet certain requirements (such as having a high school diploma) before they can 
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be considered for a job. If 98% of the majority group meets the criteria but only 96% 
of the minority group does, then the odds of meeting the criteria versus not meeting 
them would only be about half as high for the minority as for the majority. To see 
this, consider a hypothetical set of 100 people from each group, with 98 of the 100 
in the majority group and 96 of the 100 in the minority group meeting the criteria. 
The odds ratio for meeting the criteria to not meeting them for the minority com- 
pared to the majority group would be (96 X 2)/(4 X 98) = 0.49, so even if all qual- 
ified candidates were hired, the odds of being hired for the minority group would 
only be about half of those for the majority group. But the selection ratio would be 
96/98 = 98%, which should certainly be legally acceptable. As always, statistics 
must be combined with common sense to be useful. 


For Those Who Like Formulas 


An r X c contingency table is one with r categories for the row variable and c cate- 
gories for the column variable. To represent the observed numbers in a2 X 2 con- 
tingency table, we use the notation 


Variable 2 
Variable 1 Yes No Total 
Yes a b a+b 
No c d c+d 
Total a+c b+d n 


Relative Risk and Odds Ratio 


Using the notation for the observed numbers, if variable 1 is the explanatory variable 
and variable 2 is the response variable, then we can compute 


+ 
relative risk = aal 
c(a + b) 
xd 
odds ratio = : 
c 


Exercises 


Asterisked (*) exercises are included in the Solutions at the back of the book. 


1. Suppose a study on the relationship between gender and political party included 
200 men and 200 women and found 180 Democrats and 220 Republicans. Is that 
information sufficient for you to construct a contingency table for the study? If 
so, construct the table. If not, explain why not. 


2. According to the World Almanac and Book of Facts (1995, p. 964), the rate 
of deaths by drowning in the United States in 1993 was 1.6 per 100,000 
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*4, 


*6. 


population. Express this statistic as a percentage of the population; then explain 
why it is better expressed as a rate than as a percentage. 


. According to the University of California at Berkeley Wellness Letter (February 


1994, p. 1), only 40% of all surgical operations require an overnight stay at a 
hospital. Rewrite this fact as a proportion, as a risk, and as the odds of an 
overnight stay. In each case, express the result as a full sentence. 


Science News (25 February 1995, p. 124) reported a study of 232 people, aged 
55 or over, who had heart surgery. The patients were asked whether their reli- 
gious beliefs give them feelings of strength and comfort and whether they reg- 
ularly participate in social activities. Of those who said yes to both, about 1 in 
50 died within 6 months after their operation. Of those who said no to both, 
about 1 in 5 died within 6 months after their operation. What is the relative risk 
of death (within 6 months) for the two groups? Write your answer in a sentence 
or two that would be understood by someone with no training in statistics. 


. Raloff (1995) reported on a study conducted by Dimitrios Trichopolous of the 


Harvard School of Public Health in which researchers “compared the diets of 
820 Greek women with breast cancer and 1548 others admitted to Athens-area 
hospitals for reasons other than cancer.” One of the results had to do with con- 
sumption of olive oil, a staple in many Greek diets. The article reported that 
“women who eat olive oil only once a day face a 25 percent higher risk of breast 
cancer than women who consume it twice or more daily.” 


a. The increased risk of breast cancer for those who consume olive oil only once 
a day is 25%. What is the relative risk of breast cancer for those who consume 
olive oil only once a day, compared to those who eat it twice or more? 


b. What information is missing from this article that would help individuals as- 
sess the importance of the result in their own lives? 


The headline in an article in the Sacramento Bee read “Firing someone? Risk of 
heart attack doubles” (Haney, 1998). The article explained that “between 1989 
and 1994, doctors interviewed 791 working people who had just undergone 
heart attacks about what they had done recently. The researchers concluded that 
firing someone or having a high-stakes deadline doubled the usual risk of a heart 
attack during the following week. . . . For a healthy 50-year-old man or a healthy 
60-year-old woman, the risk of a heart attack in any given hour without any trig- 
ger is about | in a million.” 


*a. Refer to Chapter 5. What type of study is this? 


b. Refer to the reasons for relationships listed in Section 11.3. Which do you 
think is the most likely explanation for the relationship found between firing 
someone and having a heart attack? Do you think the headline for this arti- 
cle was appropriate? Explain. 

c. Assuming the relationship is indeed as stated in the article, write sentences 
that could be understood by someone with no training in statistics, giving 
each of the following for this example: 

i. The odds ratio 
ii. Increased risk 


iii. Relative risk 
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7. Ina test for extrasensory perception (ESP), described in Case Study 13.1 in the 
next chapter, people were asked to try to use psychic abilities to describe a hid- 
den photo or video segment being viewed by a “sender.” They were then shown 
four choices and asked which one they thought was the real answer, based on 
what they had described. By chance alone, 25% of the guesses would be ex- 
pected to be successful. The researchers tested 354 people and 122 (about 
34.5%) of the guesses were successful. In both parts a and b, express your an- 
swer in a full sentence. 


a. What are the odds of a successful guess by chance alone? 
b. What were the odds of a successful guess in the experiment? 


8. A newspaper story released by the Associated Press noted that “a study by the 
Bureau of Justice Statistics shows that a motorist has about the same chance of 
being a carjacking victim as being killed in a traffic accident, 1 in 5000” [Davis 
(CA) Enterprise, 3 April 1994, p. A9]. Discuss this statement with regard to 
your own chances of each event happening to you. 


*9, The Roper Organization (1992) conducted a study as part of a larger survey to 
ascertain the number of American adults who had experienced phenomena such 
as seeing a ghost, “feeling as if you left your body,” and seeing a UFO. A rep- 
resentative sample of adults (18 and over) in the continental United States were 
interviewed in their homes during July, August, and September 1991. The 
results when respondents were asked about seeing a ghost are shown in 
Table 12.10. 


*a, Find numbers for each of the following: 
i. The percentage of the younger group who reported seeing a ghost 
ii. The proportion of the older group who reported seeing a ghost 
iii. The risk of reportedly seeing a ghost in the younger group 


iv. The odds of reportedly seeing a ghost to not seeing one in the older 
group 
b. What is the relative risk of reportedly seeing a ghost for one group compared 
to the other? Write your answer in the form of a sentence that could be un- 
derstood by someone who knows nothing about statistics. 


c. Repeat part b using increased risk instead of relative risk. 


Table 12.10 Age and Ghost Sitings 


Reportedly Has Seen a Ghost 


Yes No Total 
Aged 18 to 29 212 1313 1525 
Aged 30 or over 465 3912 4377 
Total 677 5225 5902 


Data Source: The Roper Organization, 1992, p. 35. 
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*10. 


11. 


12. 


*13. 


Using the terminology of this chapter, what name (for example, odds, risk, rel- 
ative risk) applies to each of the boldface numbers in the following quotes? 


a. “Fontham found increased risks of lung cancer with increasing exposure to 
secondhand smoke, whether it took place at home, at work, or in a social set- 
ting. A spouse’s smoking alone produced an overall 30 percent increase in 
lung-cancer risk” (Consumer Reports, January 1995, p. 28). 


b. “What they found was that women who smoked had a risk [of getting lung 
cancer] 27.9 times as great as nonsmoking women; in contrast, the risk for 
men who smoked regularly was only 9.6 times greater than that for male 
nonsmokers” (Taubes, 26 November 1993, p. 1375). 


*c, “One student in five reports abandoning safe-sex practices when drunk” 
(Newsweek, 19 December 1994, p. 73). 


A statement quoted in this chapter was “clinically depressed people are at a 50 
percent greater risk of killing themselves” (Newsweek, 18 April 1994, p. 48). 
This means that when comparing people who are clinically depressed to those 
who are not, the former have an increased risk of killing themselves of 50%. 
What is the relative risk of suicide for those who are clinically depressed com- 
pared with those who are not? 


According to Consumer Reports (1995 January, p. 29), “among nonsmokers 
who are exposed to their spouses’ smoke, the chance of death from heart disease 
increases by about 30%.” Rewrite this statement in terms of relative risk, using 
language that would be understood by someone who does not know anything 
about statistics. 


Reporting on a study of drinking and drug use among college students in the 
United States, a Newsweek reporter wrote: 


Why should college students be so impervious to the lesson of the morning 
after? Efforts to discourage them from using drugs actually did work. The 
proportion of college students who smoked marijuana at least once in 30 
days went from one in three in 1980 to one in seven last year [1993]; co- 
caine users dropped from 7 percent to 0.7 percent over the same period. 


(19 December 1994, p. 72) 


a. What was the relative risk of cocaine use for college students in 1980 com- 
pared with college students in 1993? Write your answer as a statement that 
could be understood by someone who does not know anything about statistics. 


b. Are the figures given for marijuana use (for example, “one in three”) pre- 
sented as proportions or as odds? Whichever they are, rewrite them as the 
other. 

*e. Do you agree with the statement that “efforts to discourage them from using 
drugs actually did work”? Explain your reasoning. 


14. A well-known example of Simpson’s Paradox, published by Bickel, Hammel, 


and O’Connell (1975), examined admission rates for men and women who had 
applied to graduate programs at the University of California at Berkeley. The ac- 
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Table 12.11 An Example of Simpson’s Paradox 


Program A Program B 
Admit Deny Total Admit Deny Total 
Men 400 250 650 50 300 350 
Women 50 25 75 125 300 425 
Total 450 275 725 175 600 SID 


tual breakdown of data for specific programs is confidential, but the point can 
be made with similar, hypothetical numbers. For simplicity, we will assume 
there are only two graduate programs. The figures for acceptance to each pro- 
gram are shown in Table 12.11. 


a. 


Combine the data for the two programs into one aggregate table. What per- 
centage of all men who applied were admitted? What percentage of all 
women who applied were admitted? Which sex was more successful? 


. What percentage of the men who applied did Program A admit? What per- 


centage of the women who applied did Program A admit? Repeat the ques- 
tion for Program B. Which sex was more successful in getting admitted to 
Program A? Program B? 


. Explain how this problem is an example of Simpson’s Paradox. Provide a 


potential explanation for the observed figures by guessing what type of pro- 
grams A and B might have been. (Hint: Which program was easier to get ad- 
mitted to overall? Which program interested men more? Which program 
interested women more?) 


15. A case-control study in Berlin, reported by Kohlmeier, Arminger, Bartolomey- 
cik, Bellach, Rehm, and Thamm (1992) and by Hand et al. (1994), asked 239 
lung cancer patients and 429 controls (matched to the cases by age and sex) 
whether they had kept a pet bird during adulthood. Of the 239 lung cancer cases, 
98 said yes. Of the 429 controls, 101 said yes. 


a. 
b. 
c. 


Construct a contingency table for the data. 
Compute the risk of lung cancer for bird and non-bird owners for this study. 


Can the risks of lung cancer for the two groups, computed in part b, be used 
as baseline risks for the populations of bird and non-bird owners? Explain. 


. How much more likely is lung cancer for bird owners than for non-bird own- 


ers in this study; that is, what is the increased risk? 


. What information about risk would you want, in addition to the information 


on increased risk in part d of this problem, before you made a decision about 
whether to own a pet bird? 


16. The data in Table 12.12 are reproduced from Case Study 12.1 and represent em- 
ployees laid off by the U.S. Department of Labor. 


a. 


Compute the odds of being retained to being laid off for each ethnic group. 
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Table 12.12 Layoffs by Ethnic Group for Labor Department Employees 


Laid Off? 
Ethnic Group Yes No Total % Laid Off 
African American 130 1382 1512 8.6 
White 87 2813 2900 3.0 
Total 217 4195 4412 


Data Source: Gastwirth and Greenhouse, 1995. 


b. Use your results in part a to compute the odds ratio and confirm that it is 
about 3.0, as computed in Case Study 12.1 (where the shortcut method was 
used). 

*17. Kohler (1994, p. 427) reports data on the approval rates and ethnicity for mort- 
gage applicants in Los Angeles in 1990. Of the 4096 African American appli- 
cants, 3117 were approved. Of the 84,947 white applicants, 71,950 were 
approved. 

a. Construct a contingency table for the data. 

*b. Compute the proportion of each ethnic group that was approved for a 
mortgage. 

c. Compute the ratio of the two proportions you found in part b. Would that ra- 
tio be more appropriately called a relative risk or a selection ratio? Explain. 

d. Would the data pass the four-fifths rule used in employment and described in 
Case Study 12.1? Explain. 

OQ) 18. Read News Story 13 in the Appendix and on the CD, “3 factors key for drug use 
in kids.” Identify or calculate a numerical value for each of the following from 
the information in the news story: 

a. The increased risk of smoking, drinking, getting drunk, or using illegal drugs 
for teens who are frequently bored, compared with those who are not. 

b. The relative risk of smoking, drinking, getting drunk, or using illegal drugs 
for teens who are frequently bored, compared with those who are not. 

c. The proportion of teens in the survey who said they have no friends who reg- 
ularly drink. 


d. The percent of teens in the survey who do have friends who use marijuana. 


So 


*19. Read News Story 13 in the Appendix and on the CD, “3 factors key for drug use 
in kids.” Identify the statistical term for the number(s) in bold in each of the fol- 
lowing quotes (for example, relative risk). 

*a, “And kids with $25 or more a week in spending money are nearly twice as 
likely to smoke, drink or use drugs as children with less money.” 
b. “High stress was experienced more among girls than boys, with nearly one 
in three saying they were highly stressed compared with fewer than one in 
four boys.” 


<p 20. 


phy 


© *21. 


op 22. 
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c. “Kids at schools with more than 1,200 students are twice as likely as those 
attending schools with fewer than 800 students to be at high risk for sub- 
stance abuse.” 


d. “Children ages 12 to 17 who are frequently bored are 50 percent more likely 
to smoke, drink, get drunk or use illegal drugs.” 


Refer to News Story 10 in the Appendix and on the CD, “Churchgoers live 
longer, study finds.” One of the statements in the news story is “women who at- 
tend religious services regularly are about 80 percent as likely to die as those not 
regularly attending.’ Discuss the extent to which each of the three “common 
ways the media misrepresent statistics about risk” from Section 12.3, listed as 
parts a—c, apply to this quote. 


a. The baseline risk is missing. 
b. The time period of the risk is not identified. 
c. The reported risk is not necessarily your risk. 


Refer to the Additional News Source accompanying News Story 10 on the CD, 
“ ‘Keeping the faith’ UC Berkeley researchers links weekly church attendance to 
longer, healthier life.” Based on the information in the story, identify or calcu- 
late a numerical value for each of the following: 


*a, The increased risk of dying from circulatory diseases for people who at- 
tended religious services less than once a week or never, compared to those 
who attended at least weekly. 


b. The relative risk of dying from circulatory diseases for people who attended 
religious services less than once a week or never, compared to those who at- 
tended at least weekly. 


c. The increased risk of dying from digestive diseases for people who attended 
religious services less than once a week or never, compared to those who at- 
tended at least weekly. 


d. The relative risk of dying from digestive diseases for people who attended 
religious services less than once a week or never, compared to those who at- 
tended at least weekly. 


Refer to Original Source 10, “Religious attendance and cause of death over 31 
years” on the CD. For this study, the researchers used a complicated statistical 
method to assess relative risk by adjusting for factors such as education and in- 
come. The resulting numbers are called “relative hazards” instead of relative 
risks (abbreviated RH), but have the same interpretation as relative risk. Refer 
to the relative hazards (RH) in Table 4 of the article. Write a sentence or two that 
someone with no training in statistics would understand, presenting each of the 
following for those who do not attend weekly religious services compared with 
those who do: 


a. The relative risk of dying from all causes for women under age 70 
b. The increased risk of dying from all causes for men under age 70 

c. The relative risk of dying from cancer for men age 70+ 
d 


. The relative risk of dying from cancer for men under age 70 
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Mini-Projects 


1. Carefully collect data cross-classified by two categorical variables for which 
you are interested in determining whether there is a relationship. Do not get the 
data from a book or journal; collect it yourself. Be sure to get counts of at least 
5 in each cell and be sure the individuals you use are not related to each other 
in ways that would influence their data. 


a. Create a contingency table for the data. 


b. Compute and discuss the risks and relative risks. Are those terms appropri- 
ate for your situation? Explain. 


c. Write a summary of your findings, including whether a cause-and-effect 
conclusion could be made if you observed a relationship. 


2. Find a news story that discusses a study showing increased (or decreased) risk 
of one variable based on another. Write a report evaluating the information given 
in the article and discussing what conclusions you would reach based on the in- 
formation in the article. Discuss whether any of the features in Section 12.3, 
“Misleading Statistics about Risk,” apply to the situation. 


3. Refer to News Story 12, “Working nights may increase breast cancer risk” in the 
Appendix and on the CD and the accompanying three journal articles on the CD. 
Write a two- to four-page report describing how the studies were done and what 
they found in terms of relative risks and odds ratios. (Note that complicated sta- 
tistical methods were used that adjusted for things like reproductive history, but 
you should still be able to interpret the odds ratios and relative risks reported in 
the articles.) Discuss shortcomings you think might apply to the results, if any. 
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Statistical Significance 


for 2 X 2 Tables 


Thought Questions 
1. Suppose that a sample of 400 people included 100 under age 30 and 300 aged 30 


and over. Each person was asked whether or not they supported requiring public 
school children to wear uniforms. Fill in the number of people who would be ex- 
pected to fall into the cells in the following table, if there is no relationship between 
age and opinion on this question. Explain your reasoning. (Hint: Notice that overall, 
30% favor uniforms.) 


Yes, Favor Uniforms No, Don’t Favor Uniforms Total 
Under 30 100 
30 and over 300 
Total 120 280 400 


. Suppose that in a random sample of 10 males and 10 females, 7 of the males (70%) 


and 4 of the females (40%) admitted that they have fallen asleep at least once while 
driving. Would these numbers convince you that there is a difference in the propor- 
tions of males and females in the population who have fallen asleep while driving? 
Now suppose the sample consisted of 1000 of each sex but the proportions remained 
the same, with 700 males and 400 females admitting that they had fallen asleep at 
least once while driving. Would these numbers convince you that there is a difference 
in the population proportions who have fallen asleep? Explain the difference in the 
two scenarios. 
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3. Based on the data from Example 1 in Chapter 12, we can conclude that there is a 
Statistically significant relationship between taking aspirin or not and having a heart 
attack or not. What do you think it means to say that the relationship is “statistically 
significant”? 

4. Refer to the previous question. Do you think that a statistically significant relationship 
is the same thing as an important and sizeable relationship? Explain. 


13.1 Measuring the Strength of the Relationship 


The Meaning of Statistical Significance 


The purpose of this chapter is to help you understand what researchers mean when 
they say that a relationship between two categorical variables is statistically signifi- 
cant. In plain language, it means that a relationship the researchers observed in a 
sample was unlikely to have occurred unless there really is a relationship in the pop- 
ulation. In other words, the relationship in the sample is probably not just a statisti- 
cal fluke. However, it does not mean that the relationship between the two variables 
is significant in the common English definition of the word. The relationship in the 
population may be real, but so small as to be of little practical importance. 

Suppose researchers want to know if there is a relationship between two cate- 
gorical variables. One example in Chapter 12 is whether there is a relationship be- 
tween taking aspirin (or not) and having a heart attack (or not). Another example is 
whether there is a relationship between smoking (or not) and getting pregnant easily 
when trying. In most cases, it would be impossible to measure the two variables on 
everyone in the population. So, researchers measure the two categorical variables on 
a sample of individuals from a population, and they are interested in whether or not 
there is a relationship between the two variables in the population. It is easy to see 
whether or not there is a relationship in the sample. In fact, there almost always is. 
The percentage responding in a particular way is unlikely to be exactly the same for 
all categories of an explanatory variable. Researchers are interested in assessing 
whether the differences in observed percentages in the sample are just chance dif- 
ferences, or if they represent a real difference for the population. If a relationship as 
strong as the one observed in the sample (or stronger) would be unlikely without a 
real relationship in the population, then the relationship in the sample is said to be 
statistically significant. The notion that it could have happened just by chance is 
deemed to be implausible. 


Measuring the Relationship 
in a2 X 2 Contingency Table 
We discussed the concept of statistical significance briefly in Chapter 10. Recall 


that the term can be applied if an observed relationship in a sample is stronger than 
what would be expected by chance if there were no relationship in the population. 
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EXAMPLE 1 


EXAMPLE 2 


EXAMPLE 3 


Specifically, we required such a relationship to be larger than 95% of those that 
would be observed just by chance. Let’s see how that rule can be applied to rela- 
tionships between categorical variables. 

We will consider only the simplest case, that of 2 X 2 contingency tables. In 
other words, we will consider only the situation where each of two variables has two 
categories. The same principles and interpretation apply to tables of two variables 
with more than two categories each, but the details are more cumbersome. 

In Chapter 12 we saw that relative risk and odds ratios were both useful for mea- 
suring the relationship between outcomes in a 2 X 2 table. Another way to measure 
the strength of the relationship is by the difference in the percentages of outcomes 
for the two categories of the explanatory variable. In many cases, this measure will 
be easier to interpret than the relative risk or odds ratio, and it provides a very gen- 
eral method for measuring the strength of the relationship between the two variables. 
However, before we assess statistical significance, we need to incorporate informa- 
tion about the size of the sample as well. Our examples will illustrate why this fea- 
ture is necessary. Let’s revisit some examples from Chapter 12 and use them to 
illustrate why the size of the sample is important. 


Aspirin and Heart Attacks 


In Case Study 1.2 and Example 1 in Chapter 12 we learned about an experiment in 
which physicians were randomly assigned to take aspirin or a placebo. They were ob- 
served for 5 years and the response variable for each man was whether or not he had a 
heart attack. As shown in Table 12.1 on page 219, 104 of the 11,037 aspirin takers had 
heart attacks, whereas 189 of the 11,034 placebo takers had them. Notice that the 
difference in percentage of heart attacks between aspirin and placebo takers is only 
1.71% — 0.94% = 0.77%—less than 1%. Based on this small difference in percents, 
can we be convinced by the data in this sample that there is a real relationship in the 
population between taking aspirin and risk of heart attack? Or would 293 men have had 
heart attacks anyway, and slightly more of them just happened to be assigned to the 
placebo group? Assessing whether or not the relationship is statistically significant will 
allow us to answer that question. E 


Young Drivers, Gender, and Driving 
Under the Influence of Alcohol 


In Case Study 6.3 and Example 2 in Chapter 12, data were presented for a roadside sur- 
vey of young drivers. Of the 481 males in the survey, 77, or 16%, had been drinking in 
the past 2 hours. Of the 138 females in the survey, 16 of them, or 11.6%, had been 
drinking. Notice that the difference between males and females who had been drinking 
is 16% — 11.6% = 4.4%. Is this difference large enough to provide convincing evi- 
dence that there is a difference in the percent of young males and females in the pop- 
ulation who drink and drive? If in fact the population percents are equal, how likely 
would we be to observe a sample with a difference as large as 4.4% or larger? We will 
determine the answer to that question later in this chapter. E 


Ease of Pregnancy for Smokers and Nonsmokers 

In Example 3 of Chapter 12, the explanatory variable is whether or not a woman smoked 
while trying to get pregnant and the response variable is whether she was able to 
achieve pregnancy during the first cycle of trying. The difference between the percent- 
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age of nonsmokers and smokers who achieved pregnancy during the first cycle is 
41% — 29% = 12%. There were 486 nonsmokers and 100 smokers in the study. Is a 
difference of 12% large enough to rule out chance, or could it be that this particular 
sample just happened to have more smokers in the group that had trouble getting preg- 
nant? For the population of all smokers and nonsmokers who are trying to get pregnant 
is it really true that smokers are less likely to get pregnant in the first cycle? We will de- 
termine the answer to that question in this chapter. a 


Strength of the Relationship versus Size of the Study 


Can we conclude that the relationships observed for the samples in these examples 
also hold for the populations from which the samples were drawn? The difference of 
less than 1% (0.77%) having heart attacks after taking aspirin and placebo seems 
rather small, and in fact if it had occurred in a study with only a few hundred men, 
it would probably not be convincing. But the experiment included over 22,000 men, 
so perhaps such a small difference should convince us that aspirin really does work 
for the population of all men represented by those in the study. 

The difference of 4.4% between male and female drinkers is also rather small 
and was not convincing to the Supreme Court. The difference of 12% in Example 3 
is much larger, but is it large enough to be convincing based on fewer than 600 
women? Perhaps another study, on a different 600 women, would yield exactly the 
opposite result. 

At this point, you should be able to see that whether we can rule out chance de- 
pends on both the strength of the relationship and on how many people were in- 
volved in the study. An observed relationship is much more believable if it is based 
on 22,000 people (as in Example 1) than if it is based on about 600 people (as in Ex- 
amples 2 and 3). 


13.2 Steps for Assessing Statistical Significance 


In Chapter 22 we will learn about assessing statistical significance for a variety of 
situations using a method called hypothesis testing. There are four basic steps re- 
quired for any situation in which this method is used. 


The basic steps for hypothesis testing are 


1. Determine the null hypothesis and the alternative hypothesis. 


2. Collect the data and summarize them with a single number called a test 
Statistic. 

3. Determine how unlikely the test statistic would be if the null hypothesis 
were true. 


4. Make a decision. 
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EXAMPLE 1 
CONTINUED 


EXAMPLE 2 
CONTINUED 


In this chapter we discuss how to carry out these steps when the question of in- 
terest is whether there is a relationship between two categorical variables. 


Step 1: Determine the null hypothesis and the alternative hypothesis. 

In general, the alternative hypothesis is what the researchers are interested in 
showing to be true, so it is sometimes called the research hypothesis. The null hy- 
pothesis is usually some form of “nothing happening.” In the situation in this chap- 
ter, in which the question of interest is whether two categorical variables are related, 
the hypotheses can be stated in the following general form: 


Null hypothesis: There is no relationship between the two variables in the 
population. 


Alternative hypothesis: There is a relationship between the two variables in the 
population. 


In specific situations, these hypotheses are worded to fit the context. But in general, 
the null hypothesis is that there is no relationship and the alternative hypothesis is 
that there is a relationship. 

Remember that a cause-and-effect conclusion cannot be made unless the data are 
from a randomized experiment. Notice that the hypotheses are stated in terms of 
whether or not there is a relationship and do not mention that one variable may cause 
a change in the other variable. When the data are from a randomized experiment, of- 
ten the alternative hypothesis can be interpreted to mean that the explanatory vari- 
able caused a change in the response variable. 

Here are the hypotheses for the three examples from the previous section: 


Aspirin and Heart Attacks 


The participants in this experiment were all male physicians. The population to which the 
results apply depends on what larger group they are likely to represent. That isn’t a sta- 
tistical question; it's one of common sense. It may be all males in white-collar occupa- 
tions, or all males with somewhat active jobs. You need to decide for yourself what 
population you think is represented. We will state the hypotheses by simply using the 
word population without reference to what it is. 


Null hypothesis: There is no relationship between taking aspirin and risk of heart 
attack in the population. 


Alternative hypothesis: There is a relationship between taking aspirin and risk of 
heart attack in the population. 


Because the data in this case came from a randomized experiment, if the alternative hy- 
pothesis is the one chosen, it is reasonable to state in the conclusion that aspirin actu- 
ally causes a change in the risk of heart attack in the population. ag 


Drinking and Driving 
For this example, the sample was drawn from young drivers (under 20 years of age) in 
the Oklahoma City area in August of 1972 and 1973. Again, the population is defined 


as the larger group represented by these drivers. We could consider that to be only 
young drivers in that area at that time period or young drivers in general. 
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Null hypothesis: Males and females in the population are equally likely to drive 
within 2 hours of drinking alcohol. 


Alternative hypothesis: Males and females in the population are not equally likely 
to drive within 2 hours of drinking alcohol. 


Notice that the alternative hypothesis does not specify whether males or females are 
more likely to drink and drive, it simply states that they are not equally likely to 
do so. a 


EXAMPLE 3 Smoking and Pregnancy 


CONTINUED Null hypothesis: Smokers and nonsmokers are equally likely to get pregnant during 


the first cycle in the population of women trying to get pregnant. 


Alternative hypothesis: Smokers and nonsmokers are not equally likely to get preg- 
nant during the first cycle in the population of women trying to get pregnant. 


As in the previous two examples, notice that the alternative hypothesis does not specify 
which group is likely to get pregnant more easily. It simply states that there is a differ- 
ence. In later chapters we will learn how to test for a difference of a particular type. 
The data for this example were obviously not based on a randomized experiment; it 
would be unethical to randomly assign women to smoke or not. Therefore, even if the 
data allow us to conclude that there is a relationship between smoking and time to preg- 
nancy, it isn't appropriate to conclude that smoking causes a change in ease of getting 
pregnant. There could be confounding variables, like diet or alcohol consumption, that 
are related to smoking behavior and also to ease of getting pregnant. a 


13.3 The Chi-Square Test 


The hypotheses are established before collecting any data. In fact, it is not accept- 
able to examine the data before specifying the hypotheses. One sacred rule in statis- 
tics is that it is not acceptable to use the same data to determine and test hypotheses. 
That would be cheating, because one could collect data for a variety of potential re- 
lationships, then test only those for which the data appear to show a statistically sig- 
nificant difference. 

However, the remaining steps are carried out after the data have been collected. 
In the scenario of this chapter, in which we are trying to determine if there is a rela- 
tionship between two categorical variables, the procedure is called a chi-square test. 
Steps 2 through 4 proceed as follows: 


Step 2: Collect the data and summarize it with a single number called a test statistic. 

The “test statistic” in this case is called a chi-square statistic. It compares the 
data in the sample to what would be expected if there were no relationship between 
the two variables. Details are presented after a summary of the remaining steps. 


Step 3: Determine how unlikely the test statistic would be if the null hypothesis 
were true. 

This step is the same for any hypothesis test, and the resulting number is called 
the p-value because it’s a probability. (We will learn the technical definition of 
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probability in Chapter 16.) Specifically, the p-value is the probability of observing a 
test statistic as extreme as the one observed or more so if the null hypothesis is re- 
ally true. In the scenario in this chapter, extreme simply means “large.” So, the 
p-value is the probability that the chi-square statistic found in step 2 would be as 
large as it is or larger if in fact the two variables are not related in the population. 
The details of computing this probability are beyond the level of this book, but it is 
simple to find using computer software such as Microsoft Excel. 


Step 4: Make a decision. 

In general, results are said to be statistically significant if the p-value is 0.05 
(5%) or less. This is an arbitrary criterion, but it is well established in almost all ar- 
eas of research. Occasionally, researchers will require the p-value to be 0.01 (1%) or 
less, but if that’s the case it will be stated explicitly. This criterion basically says that 
the sample results would be implausible unless there really is a relationship in the 
population. So, we conclude that there is a relationship in the population if the 
p-value is 0.05 or less. For the scenario in this chapter, the results are statistically sig- 
nificant if the chi-square statistic is 3.84 or less. That criterion is equivalent to a 
p-value of 0.05 or less. In other words, if the two variables are not related in the pop- 
ulation, then the chi-square statistic will be 3.84, or larger 5% of the time just by 
chance. Since we now know that the chi-square statistic won’t often be that large 
(3.84 or larger) if the two variables are not related, researchers conclude that if it is 
that large, then the two variables probably are related. 


Here is a summary of the decision that’s made for a 2 X 2 contingency table: 


m If the chi-square statistic is at least 3.84, the p-value is 0.05 or less, so conclude that 
the relationship in the population is real. Equivalent ways to state this result are 
The relationship is statistically significant. 
Reject the null hypothesis (that there is no relationship in the population). 
Accept the alternative hypothesis (that there is a relationship in the 


population). 


m If the chi-square statistic is less than 3.84, the p-value is greater than 0.05, so there 
isn’t enough evidence to conclude that the relationship in the population is real. 
Equivalent ways to state this result are 

The relationship is not statistically significant. 


Do not reject the null hypothesis (that there is no relationship in the 
population). 
The relationship in the sample could have occurred by chance. 


Notice that we do not accept the null hypothesis; we simply conclude that the evi- 
dence isn’t strong enough to reject it. The reason for this will become clear later. 


Computing the Chi-Square Statistic 


To assess whether a relationship in a 2 X 2 table achieves statistical significance, we 
need to know the value of the chi-square statistic for the table. This statistic is a 
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Table 13.1 Time to Pregnancy for Smokers and Nonsmokers 


Pregnancy Occurred After 


First Cycle Two or More Cycles Total Percentage in First Cycle 


Smoker 29 71 100 29% 
Nonsmoker 198 288 486 41% 
Total 227 359 586 38.7% 


measure that combines the strength of the relationship with information about the 
size of the sample to give one summary number. If that summary number is larger 
than 3.84, the relationship in the table is considered to be statistically significant. 

The actual computation and assessment of statistical significance is tedious but 
not difficult. There are different ways to represent the necessary formula, some of 
them useful only for 2 X 2 tables. Here we present only one method, but this method 
can be used for tables with any number of rows and columns. As we list the neces- 
sary steps, we will demonstrate the computation using the data from Example 3, 
shown in Table 12.4 and again in Table 13.1. 


Computing a chi-square statistic: 

1. Compute the expected counts, assuming the null hypothesis is true. 
2. Compare the observed and expected counts. 

3. Compute the chi-square statistic. 


Note: This method is valid only if there are no empty cells in the table and if 
all expected counts are at least 5. 


Compute the Expected Counts, 

Assuming the Null Hypothesis Is True 

Compute the number of individuals that would be expected to fall into each of the 
cells of the table if there were no relationship. The formula for finding the expected 
count in any row and column combination is 


expected count = (row total)(column total)/(table total) 


The expected counts in each row and column must sum to the same totals as the ob- 
served numbers, so for a 2 X 2 table we need only compute one of these using the 
formula. We can obtain the rest by subtraction. 

It is easy to see why this formula would give the number to be expected if there 
were no relationship. Consider the first column. The proportion who fall into the first 
column overall is (column 1 total)/(table total). For instance, in Table 13.1, 227/586 
or 38.7% of the women got pregnant in the first cycle. If there is no relationship 
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EXAMPLE 3 
CONTINUED 


EXAMPLE 3 
CONTINUED 


between the two variables, then that proportion should be the same for both rows. In 
the example, we would expect the same proportion of smokers and nonsmokers to 
get pregnant in the first cycle if indeed there is no effect of smoking. Therefore, to 
find how many of the people in row 1 would be expected to be in column 1, simply 
take the overall proportion who are in column 1 and multiply it by the number of 
people who are in row 1. In other words, use 


(row 1 total)(column 1 total) 
(table total) 


expected count in first row and first column = 


Expected Counts if Smoking and Ease 
of Pregnancy Are Not Related 


Let's begin by computing the expected number of smokers achieving pregnancy in the 
first cycle, assuming smoking does not affect ease of pregnancy: 
expected count for row 1 and column 1 = (100)(227)/586 = 38.74 


It’s very important that the numbers not be rounded off at this stage. Now that we have 
the expected count for the first row and column, we can fill in the rest of the expected 
counts (see Table 13.2), making sure the row and column totals remain the same as they 
were in the original table. E 


Compare the Observed and Expected Counts 

For this step, we compute the difference between what we would expect by chance 
(the “expected counts”) and what we have actually observed. To remove negative 
signs and to standardize these differences based on the number in each combination, 
we compute the following for each of the cells of the table: 


(observed count — expected count)? / (expected count) 


In a 2 X 2 table, the numerator will actually be the same for each cell. In contin- 
gency tables with more than two rows or columns (or both), this would not be the 
case. 


Comparing Observed and Expected Numbers of Pregnancies 
The denominator for the first cell is 38.74 and the numerator is 


(observed count — expected count)? = (29 — 38.74)? = (—9.74)* = 94.87 


Table 13.2 Computing the Expected Counts for Table 13.1 


Pregnancy Occurred After 
First Cycle Two or More Cycles Total 
Smoker (100)(227)/586 = 38.74 100 — 38.74 = 61.26 100 
Nonsmoker 227 — 38.74 = 188.26 486 — 188.26 = 297.74 486 


Total 227 359 586 


EXAMPLE 3 
CONTINUED 
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Table 13.3 Comparing the Observed and Expected Counts for Table 13.1 


First Cycle Two or More Cycles 
Smoker 94.87/38.74 = 2.45 94.87/61.26 = 1.55 
Nonsmoker 94.87/188.26 = 0.50 94.87/297.74 = 0.32 


Convince yourself that this same numerator applies for the other three cells. The contri- 
bution for each cell is shown in Table 13.3. a 


Compute the Chi-Square Statistic 
To compute the chi-square statistic, simply add the numbers in all of the cells from 
step 2. The result is the chi-square statistic. 
The Chi-Square Statistic for Comparing 
Smokers and Nonsmokers 
chi-square statistic = 2.45 + 1.55 + 0.50 + 0.32 = 4.82 a 


Making the Decision 

Let’s revisit the rationale for the decision. Remember that for a 2 X 2 table, the re- 
lationship earns the title of “statistically significant” if the chi-square statistic is at 
least 3.84. The origin of the “magic” number 3.84 is too technical to describe here. 
It comes from a table of percentiles representing what should happen by chance, 
similar to the percentile table for z-scores that was included in Chapter 8. For larger 
contingency tables, you would need to look up the appropriate number in a table 
called “percentiles of the chi-square distribution,’ which is found in most statistics 
books. Many calculators and computer applications such as Excel can also provide 
these numbers. 

The interpretation of the value 3.84 is straightforward. Of all 2 X 2 tables for 
sample data from populations in which there is no relationship, 95% of the tables 
will have a chi-square statistic of 3.84 or less. Relationships in the sample that re- 
flect a real relationship in the population are likely to produce larger chi-square sta- 
tistics. Therefore, if we observe a relationship that has a chi-square statistic larger 
than 3.84, we can assume that the relationship in the sample did not occur by chance. 
In that case, we say that the relationship is statistically significant. Of all relation- 
ships that have occurred just by chance, 5% of them will erroneously earn the title 
of statistically significant. 

It is also possible to miss a real relationship. In other words, it’s possible that a 
sample from a population with a real relationship will result in a chi-square statistic 
that’s less than the magic 3.84. As will be described later in this chapter, this is most 
likely to happen if the size of the sample is too small. In that case, a real relationship 
may not be detected as statistically significant. Remember that the chi-square statis- 
tic depends on both the strength of the relationship and the size of the sample. 
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Figure 13.1 

Minitab (Version 13) 
Results for Example 3 on 
Smoking and Pregnancy 
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Expected counts are printed below observed counts 
First Cy Two or M Total 

Smoker 29 71 100 

38.74 61.26 
Nonsmoker 198 288 486 

188.26 297.74 
Total 227 359 586 
Chi-Sq = 2.448 + 1.548 + 

0.504 + 0.318 = 4.817 

DF = 1, P-Value = 0.028 


Deciding if Smoking Affects Ease of Pregnancy 

The chi-square statistic computed for the example is 4.82, which is larger than 3.84. 
Thus, we can say there is a statistically significant relationship between smoking and 
time to pregnancy. In other words, we can conclude that the difference we observed in 
time to pregnancy between smokers and nonsmokers in the sample indicates a real dif- 
ference for the population of all similar women. It was not just the luck of the draw for 
this sample. This result is based on the assumption that the women studied can be con- 
sidered to be a random sample from that population. a 


Computers, Calculators, and Chi-Square Tests 


Many simple computer programs and graphing calculators will compute the chi- 
square statistic for you. Figure 13.1 shows the results of using a statistical comput- 
ing program called Minitab to carry out the example we have just computed by hand. 
Notice that the computer has done all the work for us and presented it in summary 
form. The original, observed counts are displayed first for each cell, and the expected 
counts are displayed below them. After the table of observed and expected counts, 
the chi-square statistic is computed, showing the contribution for each cell in the 
same format as the table. Finally, the p-value is presented. 

The only thing the computer did not supply is the decision. But it tells us that the 
chi-square statistic is 4.817 and the p-value is 0.028. Based on that information, we 
can reach our own conclusion that the relationship is statistically significant. 

Microsoft Excel will provide the p-value for you once you know the chi-square 
statistic. The function is CHIDIST(x,df) where “x” is the value of the chi-square sta- 
tistic and “df” is short for “degrees of freedom.” In general, for a chi-square test 
based on a table with r rows and c columns (not counting the “totals” row and col- 
umn), the degrees of freedom = (r — 1)(c — 1). When there are 2 rows and 2 
columns, df = 1. As an illustration, to find the p-value for the chi-square statistic of 
4.82 for Example 3, use CHIDIST(4.82,1). Excel returns the value 0.028131348, or 
about 0.028, the same value provided by Minitab. This tells us that in about 2.8% of 
all samples from populations in which there is no relationship, the chi-square statis- 
tic will be 4.82 or larger just by chance. Again, for our example we use this as evi- 


EXAMPLE 4 
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dence that the sample didn’t come from a population with no relationship—it came 
from one where the two variables of interest (smoking and ease of pregnancy) are 
related. 


Age at Birth of First Child and Breast Cancer 


In Example 4 of Chapter 12 we presented data from a study of the relationship between 
age at which a woman gave birth to her first child and subsequent occurrence of breast 
cancer. The relative risk of breast cancer was 1.33, with women having their first child 
at age 25 or older having greater risk. The study was based on a sample of over 6000 
women, and the results are shown again in Table 13.4. Is the relationship statistically sig- 
nificant? Let’s go through the four steps of testing this hypothesis. 


Step 1: Determine the null hypothesis and the alternative hypothesis. 


Null hypothesis: There is no relationship between age at birth of first child and breast 
cancer in the population of women who have had children. 


Alternative hypothesis: There is a relationship between age at birth of first child and 
breast cancer in the population of women who have had children. 


Step 2: Collect the data and summarize it with a single number called a test statistic. 
Expected count for “Yes and Breast Cancer” = (1628)(96)/6168 = 25.34. By sub- 

traction, the other expected counts can be found as shown in Table 13.5. Therefore, the 

chi-square statistic is 

(31 — 25.34) | (1597 — 1602.66)? | (65 — 70.66)? | (4475 — 4469.34) _ 


H } H 1.75 
25.34 1602.66 70.66 4469.34 


Table 13.4 Age at Birth of First Child and Breast Cancer 


First Child at Age 25 or Older? Breast Cancer No Breast Cancer Total 


Yes 31 1597 1628 
No 65 4475 4540 
Total 96 6072 6168 


Source: Pagano and Gauvreau (1993). 


Table 13.5 Expected Counts for Age at Birth of First Child 
and Breast Cancer 


First Child at Age 25 

or Older? Breast Cancer No Breast Cancer Total 
Yes (1628)(96)/6168 = 25.34 1628 — 25.34 = 1602.66 1628 
No 96 — 25.34 = 70.66 4540 — 70.66 = 4469.34 4540 


Total 96 6072 6168 
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Step 3: Determine how unlikely the test statistic would be if the null hypothesis were 
true. 
Using Excel, the p-value can be found as CHIDIST(1.75, 1) = 0.186. 


Step 4: Make a decision. 

Because the chi-square statistic is less than 3.84 (and the p-value of 0.186 is greater 
than .05), the relationship is not statistically significant, and we cannot conclude that the 
increased risk observed in the sample would hold for the population of women. The re- 
lationship could simply be due to the luck of the draw for this sample. The relative risk 
in the population may be 1.0, meaning that both groups are at equal risk for develop- 
ing breast cancer. Remember that even if the null hypothesis had been rejected, we 
would not have been able to conclude that delaying childbirth causes breast cancer. Ob- 
viously, the data are from an observational study, because women cannot be randomly 
assigned to have children at a certain age. Therefore, there are possible confounding 
variables, such as use of oral contraceptives at a young age, that may be related to age 
at birth of first child and may have an effect on likelihood of breast cancer. E 


13.4 Practical versus Statistical Significance 


You should be aware that “statistical significance” does not mean the two variables 
have a relationship that you would necessarily consider to be of practical impor- 
tance. For example, a table based on a very large number of observations will have 
little trouble achieving statistical significance, even if the relationship between the 
two variables is only minor. Conversely, an interesting relationship in a population 
may fail to achieve statistical significance in the sample if there are only a few ob- 
servations. It is difficult to rule out chance unless you have either a very strong rela- 
tionship or a sufficiently large sample. 

To see this, consider the relationship between taking aspirin instead of a placebo 
and having a heart attack or not. The chi-square statistic, based on the result from the 
22,071 participants in the study, is 25.01, so the relationship is clearly statistically 
significant. Now suppose there were only one-tenth as many participants, or 2207— 
still a fair-sized sample. Further suppose that the heart attack rates remained the 
same, at 9.4 per thousand for the aspirin group and 17.1 per thousand for the placebo 
group. What would happen then? 

If you look at the method for computing the chi-square statistic, you will realize 
that if all numbers in the study are divided by 10, the resulting chi-square statistic is 
also divided by 10. This is because the numerator of the contribution for each cell is 
squared, but the denominator is not. Therefore, if the aspirin study had only 2207 
participants instead of 22,071, the chi-square statistic would have been only 2.501 
(25.01/10). It would not have been large enough to conclude that the relationship be- 
tween heart attacks and aspirin consumption was statistically significant, even 
though the rates of heart attacks per thousand people were still 9.4 for the aspirin 
group and 17.1 for the placebo group. 


EXAMPLE 2 
CONTINUED 


Figure 13.2 

Minitab Results for Ex- 
ample 2 on Drinking 
and Driving 
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No Relationship versus No Statistically 
Significant Relationship 


Some researchers report the lack of a statistically significant result erroneously, by im- 
plying that a relationship must therefore not exist. When you hear the claim that a 
study “failed to find a relationship” or that “no relationship was found” between two 
variables, it does not mean that a relationship was not observed in the sample. It means 
that whatever relationship was observed did not achieve statistical significance. When 
you hear of such a result, always check to make sure the study was not based on a 
small number of individuals. If it was, remember that with a small sample, it takes a 
very strong relationship for it to earn the title of “statistical significance.” 

Be particularly wary if researchers report that no relationship was found, or that 
the proportions with a certain response are equal for the categories of the explana- 
tory variable in the population. In general, they really mean that no statistically sig- 
nificant relationship was found. It is impossible to conclude, based on a sample, 
anything exact about the population. This is why we don’t say that we can accept the 
null hypothesis of no relationship in the population. 


Drinking and Driving 

Let's examine in more detail the evidence in Example 2 that was presented to the 
Supreme Court to see if we can rule out chance as an explanation for the higher per- 
centage of male drivers who had been drinking. In Figure 13.2, we present the results 
of asking the Minitab program to compute the chi-square statistic. Notice that the sum- 
mary statistic is only 1.637—which is not large enough to find a statistically significant 
difference in percentages of males and females who had been drinking. You can see 
why the Supreme Court was reluctant to conclude that the difference in the sample rep- 
resented sufficient evidence for a real difference in the population. 

This example provides a good illustration of the distinction between statistical and 
practical significance and how it relates to the size of the sample. You might think that 
a real relationship for the population is indicated by the fact that 16% of the males but 
only 11.6% of the females in the sample had been drinking. But the chi-square test tells 


Expected counts are printed below observed counts 


Yes No Total 
Male 77 404 481 
T2627 408.73 


Female 16 122 138 
20:73 LLT 27 


Total 93 526 619 
Chi-Sq = 0.310 + 0.055 + 

1.081 + 0.191 = 1.637 
DF = 1, P-Value = 0.201 
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us that a difference of that magnitude in a sample of this size would not be at all sur- 
prising, if in fact equal proportions of males and females in the population had been 
drinking. If those same percents were found in a much larger sample, then the evidence 
would be convincing. Notice that if the sample were three times as large, but the per- 
cents drinking remained at 16% and 11.6%, then the chi-square statistic would be 
(3)(1.637) = 4.91, and the difference would indeed be statistically significant. E 


Extrasensory Perception Works Best with Movies 


Extrasensory perception (ESP) is the apparent ability to obtain information in ways 
that exclude ordinary sensory channels. Early laboratory research studying ESP fo- 
cused on having people try to guess at simple targets, such as symbols on cards, to 
see if the subjects could guess at a better rate than would be expected by chance. In 
recent years, experimenters have used more interesting targets, such as photographs, 
outdoor scenes, or short movie segments. 

In a study of ESP reported by Bem and Honorton (January 1994), subjects 
(called receivers) were asked to describe what another person (the sender) had just 
watched on a television screen in another room. The receivers were shown four pos- 
sible choices and asked to pick which one they thought had been viewed by the 
sender in the other room. Because the actual target was randomly selected from 
among the four choices, the guesses should have been successful by chance 25% of 
the time. Surprisingly, they were actually successful 34% of the time. 

For this case study, we are going to examine a categorical variable that was in- 
volved and ask whether the results were affected by it. The researchers had hypoth- 
esized that moving pictures might be received with better success than ordinary 
photographs. To test that theory, they had the sender sometimes look at a single, “sta- 
tic” image on the television screen and sometimes look at a “dynamic” short video 
clip, played repeatedly. The additional three choices shown to the receiver for judg- 
ing (the “decoys”) were always of the same type (static or dynamic) as the actual tar- 
get, to eliminate biases due to a preference for one over the other. The question of 
interest was whether the success rate changed based on the type of picture. The re- 
sults are shown in Table 13.6. 

Figure 13.3 shows the results from the Minitab program. Notice that the chi- 
square statistic is 6.675; this far exceeds the value of 3.84 required to declare the 


Table 13.6 Results of ESP Study 


Successful ESP Guess? 


Yes No Total % Success 
Static picture 45 119 164 27% 
Dynamic picture 77 113 190 41% 
Total 122 232 354 34% 


Source: Bem and Honorton, 1994. 


Figure 13.3 
Minitab results for Case 
Study 13.1 
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Expected counts are printed below observed counts 


Yes No Total 
Static 45 119 164 
56.52 107.48 


Dynamic TI 113 190 
65.48 124.52 


Total 122 232 354 
Chi-Sq = 2.348 + 1.235 + 


2.027 + 1.066 = 6.675 
DF = 1, P-Value = 0.010 


relationship statistically significant. The p-value is 0.010. Therefore, it does appear 
that success in ESP guessing depends on the type of picture used as the target. You 
can see that guesses for the static pictures were almost at chance (27% compared to 
25% expected by chance), whereas the guesses for the dynamic videos far exceeded 


what was expected by chance (41% compared to 25%). 


For Those Who Like Formulas 


To represent the observed counts in a 2 X 2 contingency table, we use the notation 


Variable 2 
Variable 1 Yes No Total 
Yes a b a+b 
No c d c+d 
Total a+c b+d n 


Therefore, the expected counts are computed as follows: 


Variable 2 
Variable 1 Yes No Total 
Yes (a + bla + c)/n (a + bb + dì/n a+b 
No (c + da + c)/n (c + d\(b + d)/n c+d 


Total a+c b+d n 


258 PART2 Finding Life in Data 


Computing the Chi-Square Statistic, y, for an r X c Contingency Table 
Let O; = observed count in cell i and E; = expected count in cell i, where i = 1, 2, 
..., r X c. Then, 
rXc 2 
(0; — E) 
X=) 


i=1 E; 


Exercises 


Asterisked (*) exercises are included in the Solutions at the back of the book. 


1. If there is a relationship between two variables in a population, which is more 
likely to result in a statistically significant relationship in a sample from that 
population—a small sample, a large sample, or are they equivalent? Explain. 


*2. If there is no relationship between two variables in a population, which is more 
likely to result in a statistically significant relationship in a sample—a small 
sample, a large sample, or are they equivalent? Explain. (Hint: If there is no re- 
lationship in the population, how often will the chi-square statistic be 3.84 or 
greater? Does it depend on the size of the sample?) 


3. Suppose a relationship between two variables is found to be statistically signif- 
icant. Explain whether each of the following is true in that case: 


a. There is definitely a relationship between the two variables in the sample. 
b. There is definitely a relationship between the two variables in the population. 


c. It is likely that there is a relationship between the two variables in the 
population. 


4. Are null and alternative hypotheses statements about samples, about popula- 
tions, or does it depend on the situation? Explain. 


*5,. Explain what “expected counts” represent. In other words, under what condition 
are they “expected”? 


6. For each of the following possible conclusions, state whether it would follow 
when the p-value is less than 0.05: 


. Reject the null hypothesis. 

. Reject the alternative hypothesis. 
. Accept the null hypothesis. 

. Accept the alternative hypothesis. 


. The relationship is not statistically significant. 


moan & bP 


The relationship is statistically significant. 


7. For each of the following possible conclusions, state whether it would follow 
when the p-value is greater than 0.05: 


a. Reject the null hypothesis. 
b. Reject the alternative hypothesis. 
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c. Accept the null hypothesis. 

d. Accept the alternative hypothesis. 

e. The relationship is not statistically significant. 
f. The relationship is statistically significant. 


*8. For each of the following chi-square statistics or p-values based on a chi-square 
test for a 2 X 2 table, would the relationship be statistically significant? 


*a, chi-square statistic = 1.42 
b. chi-square statistic = 4.5 
*c. p-value = 0.01 
d. p-value = 0.15 
9. In each of the following situations, specify the population. Also, state the two 
categorical variables that would be measured for each unit in the sample and the 
two categories for each variable. 

a. Researchers want to know if there is a relationship between having gradu- 
ated from college or not and voting in the last presidential election, for all 
registered voters over age 25. 

b. Researchers want to know if there is a relationship between smoking and di- 
vorce for people who were married during the 1970s and 1980s. 

c. Researchers classify midsize cities in the United States according to whether 
the city’s median family income is higher or lower than the median family 
income for the state in which the city is located. They want to know if there 
is a relationship between that classification and proximity of the city to a ma- 
jor airport, defined as whether or not one of the 30 busiest airports in the 
country is within 50 miles of the city. 

10. Refer to the previous exercise. In each case, state the null and alternative 
hypotheses. 
*11. A political poll based on a random sample of 1000 likely voters classified them 

by sex and asked them if they planned to vote for Candidate A or Candidate B 

in the upcoming election. Results are shown in the accompanying table. 


*a. State the null and alternative hypotheses in this situation. 
b. Calculate the expected counts. 


c. Explain in words the rationale for the expected counts in the context of this 
example. 


*d. Calculate the value of the chi-square statistic. 


e. Make a conclusion. State the conclusion in the context of this situation. 


Candidate A Candidate B Total 


Male 200 250 450 
Female 300 250 550 
Total 500 500 1000 
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Figure 13.4 

Minitab results for the 
Roper poll on seeing a 
ghost 


Expected counts are printed below observed counts 
Yes No Total 
18 to 29 212 1313 1525 
174.93 2350.07 


over 29 465 3912 4377 
502.07 3874.93 


Total 677 5225 5902 


+ 
2.737 + 0.355 = 11.967 


12. 


13. 


14. 


*15, 


Refer to Example 1, investigating the relationship between taking aspirin and 
risk of heart attack. As shown in Table 12.1 on page 219, 104 of the 11,037 as- 
pirin takers had heart attacks, whereas 189 of the 11,034 placebo takers had 
them. Carry out the chi-square test for this study. (The hypotheses are already 
given in this chapter.) 


In Exercise 9 of Chapter 12 results were given for a Roper Poll in which people 
were classified according to age and were asked if they had ever seen a ghost. 
The results from asking Minitab to compute the chi-square statistic are shown 
in Figure 13.4. What can you conclude about the relationship between age group 
and reportedly seeing a ghost? 


This is a continuation of Exercise 15 in Chapter 12. A case-control study in 
Berlin, reported by Kohlmeier et al. (1992) and by Hand et al. (1994), asked 239 
lung cancer patients and 429 controls (matched to the cases by age and sex) 
whether they had kept a pet bird during adulthood. Of the 239 lung cancer cases, 
98 said yes. Of the 429 controls, 101 said yes. Compute the chi-square statistic 
and assess the statistical significance for the relationship between bird owner- 
ship and lung cancer. 


Howell (1992, p. 153) reports on a study by Latané and Dabbs (1975) in which 
a researcher entered an elevator and dropped a handful of pencils, with the ap- 
pearance that it was an accident. The question was whether the males or females 
who observed this mishap would be more likely to help pick up the pencils. The 
results are shown in the table at the top of the next page. 


*a. Compute and compare the proportions of males and females who helped 
pick up the pencils. 

*b. Compute the chi-square statistic and use it to determine whether there is a 
statistically significant relationship between the two variables in the table. 
Explain your result in a way that could be understood by someone who 
knows nothing about statistics. 


*e. Would the conclusion in part b have been the same if only 262 people had 
been observed but the pattern of results was the same? Explain how you 
reached your answer and what it implies about research of this type. 


q 
A 


2 
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Helped Pick Up Pencils? 


Yes No Total 


Male observer 370 950 1320 
Female observer 300 1003 1303 
Total 670 1953 2623 


Data Source: Howell, 1992, p. 154, from a study by Latané 
and Dabbs, 1975. 


16. This is a continuation of Exercise 16 in Chapter 12. The data (shown in the ac- 
companying table) are reproduced from Case Study 12.1 and represent employ- 
ees laid off by the U.S. Department of Labor. 


Ethnic Group Laid Off Not Laid Off Total 
African American 130 1382 1512 
White 87 2813 2900 
Total 217 4195 4412 


Data Source: Gastwirth and Greenhouse, 1995. 


Minitab computed the chi-square statistic as 66.595. Explain what this means 
about the relationship between the two variables. Include an explanation that 
could be understood by someone with no knowledge of statistics. Make the as- 
sumption that these employees are representative of a larger population of 
employees. 


17. This is a continuation of Exercise 17 in Chapter 12. Kohler (1994, p. 427) re- 
ported data on the approval rates and ethnicity for mortgage applicants in Los 
Angeles in 1990. Of the 4096 African American applicants, 3117 were ap- 
proved. Of the 84,947 white applicants, 71,950 were approved. The chi-square 
statistic for these data is about 220, so the difference observed in the approval 
rates is clearly statistically significant. Now suppose that a random sample of 
890 applicants had been examined, a sample size 100 times smaller than the one 
reported. Further, suppose the pattern of results had been almost identical, re- 
sulting in 40 African American applicants with 30 approved, and 850 white ap- 
plicants with 720 approved. 


a. Construct a contingency table for these numbers. 
b. Compute the chi-square statistic for the table. 
c. Make a conclusion based on your result in part b and compare it with the 


conclusion that would have been made using the full data set. Explain any 
discrepancies and discuss their implications for this type of problem. 


Exercises 18 to 21 are based on News Story 2, “Research shows women harder hit by 
hangovers” and the accompanying Original Source 2. In the study, 472 men and 758 
women, all of whom were college students and alcohol drinkers, were asked about 
whether they had experienced each of 13 hangover symptoms in the previous year. 


262 PART2 Finding Life in Data 


18. What population do you think is represented by the sample for this study? 
Explain. 


*19. One of the results in the Original Source was “there were only two symptoms 
that men experienced more often than women: vomiting (men: 50%; women: 
44%; chi-square statistic = 4.7, p = 0.031) and sweating more than usual (men: 
34%; women: 23%; chi-square statistic = 18.9, p < 0.001)” (Slutske, Piasecki, 
and Hunt-Carter, 2003, p. 1446). 


*a. State the null and alternative hypotheses for each of these two results. 


b. State the conclusion that would be made for each of the two results, both in 
statistical terms and in the context of the situation. 


20. One of the statements in the Original Source was “men and women were equally 
likely to experience at least one of the hangover symptoms in the past year 
(men: 89%; women: 87%; chi-square statistic = 1.2, p = 0.282)” (Slutske, Pi- 
asecki, and Hunt-Carter, 2003, p. 1445). 


a. State the null and alternative hypotheses for this result. 


b. Given what you have learned in this chapter about how to state conclusions, 
do you agree with the wording of the conclusion, that men and women were 
equally likely to experience at least one of the hangover symptoms in the 
past year? If so, explain how you reached that conclusion. If not, rewrite the 
conclusion using acceptable wording. 


*21. Participants were asked how many times in the past year they had experienced 
at least one of the 13 hangover symptoms listed. Responses were categorized as 
0 times, 1-2 times, 3-11 times, 12-51 times, and = 52 times. For the purposes 
of this exercise, responses have been categorized as less than an average of once 
a month (0-11 times) versus 12 or more times. The accompanying figure shows 
the Minitab computer output for frequency of symptoms categorized in this way 
versus the categorical variable male, female. We will determine if there is con- 
vincing evidence that one of the two sexes is more likely than the other to ex- 
perience hangover symptoms at least once a month, on average. 


a. State the null and alternative hypotheses being tested. 


Expected counts are printed below observed counts 


<11 212 Total 
Male 326 140 466 
343.27 122.73 


Female 569 180 749 
5511:73 197 27 


Total 895 320 1215 
Chi-Sq = 0.869 + 2.429 + 

0.540 + 1.511 = 5.350 
DF = 1, P-Value = 0.021 


pon 
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*b. Show how the expected count of 343.27 for the “Male, = 11” category was 
computed. 

*c. Give the value of the chi-square statistic and the p-value, and make a con- 
clusion. State the conclusion in statistical terms and in the context of the 
situation. 


Exercises 22 to 24 are based on News Story 5, “Driving while distracted is common, 
researchers say,” and the accompanying Original Source 5, “Distractions in Every- 
day Driving.” 


22. 


23. 


Refer to Table 8 on page 37 of Original Source 5 on the CD. Notice that there 

is a footnote to the table that reads “*p<.05 and **p<.01, based on chi-square 

test of association with sex.” The footnote applies to “Grooming**” and “Ex- 

ternal distraction*.” 

a. Explain what null and alternative hypotheses are being tested for “Groom- 
ing**.” (Notice that the definition of “grooming” is given on page 41 of the 
report.) 


b. Explain what the footnote means. 


The accompanying figure provides Minitab computer output for testing for a re- 
lationship between sex and “External distraction” but the expected counts have 
been removed for you to fill in. 

a. Fill in the expected counts. 

b. State the null and alternative hypotheses being tested. 

c. Give the value of the chi-square statistic and the p-value, and make a con- 
clusion. State the conclusion in statistical terms and in the context of the 
situation. 

d. Explain how this result confirms the footnote given for the table in connec- 
tion with “External distraction.” 


Expected counts are printed below observed counts 


Yes No Total 
Male 27 8 35 
Female 33 2 35 
Total 60 10 70 
Chi-Sq = 0.300 + 1.800 + 

0.300 + 1.800 = 4.200 
DF = 1, P-Value = 0.040 


*24. The table at the top of the next page shows participants categorized by sex and 


by whether they were observed conversing (yes, no). 
a. State the null and alternative hypotheses that can be tested with this table. 


*b. What is the expected count for “Male, Yes?” Find the remaining expected 
counts by subtraction. 
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c. 
d. 


Compute the chi-square statistic. 


Make a conclusion in statistical terms and in the context of the situation. 


Conversing? 


Yes No Total 


Male 28 7 35 
Female 26 9 35 
Total 54 16 70 


Mini-Projects 


. Carefully collect data cross-classified by two categorical variables for which 


you are interested in determining whether there is a relationship. Do not get the 
data from a book or journal; collect it yourself. Be sure to get counts of at least 
five in each cell and be sure the individuals you use are not related to each other 
in ways that would influence their data. 


a. 
b. 


d. 
e. 


Create a contingency table for the data. 


Compute and discuss the risks and relative risks. Are those terms appropri- 
ate for your situation? Explain. 


Determine whether there is a statistically significant relationship between the 
two variables. 


Discuss the role of sample size in making the determination in part c. 


Write a summary of your findings. 


. Find a poll that has been conducted at two different time periods or by two dif- 


ferent sources. For instance, many polling organizations ask opinions about cer- 
tain issues on an annual or other regular basis. 


a. 


d. 


a. 
b. 


Create a 2 X 2 table where “time period” is one categorical variable and “re- 
sponse to poll” is the other, categorized into just two choices, such as “favor” 
and “do not favor” for an opinion question. 


State the null and alternative hypotheses for comparing responses across the 
two time periods. Be sure you differentiate between samples and populations. 


Carry out a chi-square test to see if opinions have changed over the two time 
periods. 


Write a few sentences giving your conclusion in words that someone with no 
training in statistics could understand. 


. Find a journal article that uses a chi-square test. 


State the hypotheses being tested. 
Write the contingency table. 
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c. Give the value of the chi-square statistic and the p-value as reported in the 
article. 


d. Write a paragraph or more (as needed) explaining what was tested and what 
was concluded, as if you were writing for a newspaper. 
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Reading the 
Economic News 


THOUGHT QUESTIONS 


1. The Conference Board, a not-for-profit organization, produces a composite index of 
leading economic indicators as well as one of coincident and lagging economic indi- 
cators. These are supposed to “indicate” the status of the economy. What do you 
think the terms /eading, coincident, and lagging mean in this context? 


2. Suppose you wanted to measure the yearly change in the “cost of living” for a col- 
lege student living in the dorms for the past 4 years. How would you do it? 


3. Suppose you were told that the price of a certain product, measured in 1984 dollars, 
has not risen. What do you think is meant by “measured in 1984 dollars”? 


4. How do you think governments determine the “rate of inflation”? 
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14.1 Cost of Living: The Consumer Price Index 


Everyone is affected by inflation. When the costs of goods and services rise, most 
workers expect their employers to increase their salaries to compensate. But how do 
employers know what a fair salary adjustment would be? 

The most common measure of change in the cost of living in the United States 
is the Consumer Price Index (CPI), produced by the Bureau of Labor Statistics 
(BLS). The CPI was initiated during World War I, a time of rapidly increasing prices, 
to help determine salary adjustments in the shipbuilding industry. As noted by the 
BLS, the CPI is not a true cost-of-living index: 


Both the CPI and a cost-of-living index would reflect changes in the prices of 
goods and services, such as food and clothing, that are directly purchased in 
the marketplace; but a complete cost-of-living index would go beyond this to 
also take into account changes in other governmental or environmental factors 
that affect consumers’ well-being. (U.S. Dept. of Labor, 2003, CPI Web site) 


Nonetheless, the CPI is the best available measure of changes in the cost of living in 
the United States. 

The CPI measures changes in the cost of a “market basket” of goods and services 
that a typical consumer would be likely to purchase. The cost of that collection of 
goods and services is measured during a base period, then again at subsequent time 
periods. The CPI, at any given time period, is simply a comparison of current cost 
with cost during the base period. It is supposed to measure the changing cost of 
maintaining the same standard of living that existed during the base period. 

There are actually two different consumer price indexes, but we will focus on the 
one that is most widely quoted, the CPI-U, for all urban consumers. It is estimated 
that this CPI covers about 87% of all U.S. consumers (U.S. Dept. of Labor, 2003, 
CPI Web site). The CPI-U was introduced in 1978. The other CPI, the CPI-W, is a 
continuation of the original one from the early 1900s. It is based on a subset of 
households covered by the CPI-U, for which “more than one-half of the household’s 
income must come from clerical or wage occupations, and at least one of the house- 
hold’s earners must have been employed for at least 37 weeks during the previous 
12 months” (U.S. Dept. of Labor, 2003, CPI Web site). About 32% of the U.S. pop- 
ulation is covered by the CPI-W. 

To understand how the CPI is calculated, let’s first introduce the general concept 
of price index numbers. A price index for a given time period allows you to compare 
costs with another time period for which you also know the price index. 


Price Index Numbers 


A price index number measures prices (such as the cost of a loaf of bread) at one 
time period relative to another time period, usually as a percentage. For example, if 
a loaf of bread cost $1.00 in 1984 and $1.80 in 2004, then the bread price index 
would be ($1.80/$1.00) X 100 = 180%. In other words, bread in 2004 cost 180% 
of what it cost in 1984. We could also say the price increased by 80%. 
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EXAMPLE 1 


Price index numbers are commonly computed on a collection of products instead 
of just one. For example, we could compute a price index reflecting the increasing 
cost of attending college. To define a price index number, decisions about the fol- 
lowing three components are necessary: 

1. The base year or time period 
2. The list of goods and services to be included 


3. How to weight the particular goods and services 


The general formula for computing a price index number is 
price index number = (current cost/base time period cost) X 100 


where “cost” is the weighted cost of the listed goods and services. Weights are 
usually determined by the relative quantities of each item purchased during 
the base period. 


A College Index Number 


Suppose a senior graduating from college wanted to determine by how much the cost 
of attending college had increased for each of the 4 years she was a student. Here is 
how she might specify the three components: 


1. Use her first year as a base. 


2. Include yearly tuition and fees, yearly cost of room and board, and yearly average cost 
of books and related materials. 


3. Weight everything equally because the typical student would be required to “buy” 
one of each category per year. 
Table 14.1 illustrates how the calculation would proceed. Note that we use the formula: 
college index number = (current year total/first year total) x 100 


Notice that the index for her senior year (listed in Table 14.1) is 121. This means 
that these components of a college education in her senior year cost 121% of what 
they cost in her first year. Equivalently, they have increased 21% since she started 
college. E 


Table 14.1 Cost of Attending College 

Room and 
Year Tuition Board Books Total College Index 
First $3,000 $4,900 $700 $8,600 100 
Sophomore $3,200 $5,200 $720 $9,120 (9,120/8,600) x 100 = 106 
Junior $3,500 $5,400 $750 $9,650 (9,650/8,600) x 100 = 112 
Senior $4,000 $5,600 $800 $10,400 (10,400/8,600) x 100 = 121 


CHAPTER 14 Reading the Economic News 269 


The Components of the Consumer Price Index 


The Base Year (or Years) 

The base year (or years) for the CPI changes periodically, partly so that the index 
does not get ridiculously large. If the original base year of 1913 were still used, the 
CPI would be well over 1000 and would be difficult to interpret. Since 1988, and 
continuing as of January 2004, the base period in use was the years 1982-1984. The 
previous base was the year 1967. Prior to that time, the base period was changed 
about once a decade. In December 1996, the Bureau of Labor Statistics announced 
that, beginning with the January 1999 CPI, the base period would change to 
1993-1995. Since it made that announcement, however, the BLS has decided that it 
will retain the 1982-1984 base (U.S. Dept. of Labor, 15 June 1998). 


The Goods and Services Included 
As in the case with the base year(s), the market basket of goods and services is up- 
dated about once every 10 years. Items are added, deleted, and reorganized to repre- 
sent current buying patterns. In 1998, a major revision occurred that included the 
addition of a new category called “Education and communication.” The market bas- 
ket now in use consists of over 200 types of goods and services. It was established 
primarily based on the 1993—1995 Consumer Expenditure Survey, in which a multi- 
stage sampling plan was used to select families who reported their expenditures. 
That expenditure information, from about 30,000 individuals and families, was then 
used to determine the items included in the index. 

The market basket includes most things that would be routinely purchased. 
These are divided into eight major categories, each of which is subdivided into vary- 
ing numbers of smaller categories. The eight major categories are 


. Food and beverages 
. Housing 

. Apparel 

. Transportation 

. Medical care 

. Recreation 


. Education and communication (the new category added in 1998) 


ANN UN FW NY 


. Other goods and services 


As noted, these categories are broken down into smaller ones. For example, here is 
the breakdown leading to the item “Ice cream and related products”: 


Food and beverages > Food at home > Dairy and related products 
> Ice cream and related products 


Relative Quantities of Particular Goods and Services 

Because consumers spend more on some items than on others, it makes sense to 
weight those items more heavily in the CPI. The weights assigned to each item are 
the relative quantities spent, as determined by the Consumer Expenditure Survey. 
The same weights are used on an ongoing basis and are updated occasionally, just as 


270 


PART2 Finding Life in Data 


are the base year and the market basket. Here are the relative weights in effect in De- 
cember 2002 for the eight categories, rounded to one decimal place: 


1. Food and beverages 15.6% 
2. Housing 40.9% 
3. Apparel 4.2% 
4. Transportation 17.3% 
5. Medical care 6.0% 
6. Recreation 5.9% 
7. Education and communication 5.8% 
8. Other goods and services 4.3% 


Total 100% 


You can see that housing is by far the most heavily weighted category. This 
makes sense, especially because costs associated with diverse items such as utilities 
and furnishings are included under the general heading of housing. 


Obtaining the Data for the CPI 


It is, of course, not possible to actually measure the average price of items paid by 
all families. The CPI is composed of samples taken from around the United States. 
Each month the sampling occurs at about 23,000 retail and service establishments in 
87 urban areas, and prices are measured on about 80,000 items. Rents are measured 
from about 50,000 landlords and tenants. 

Obviously, determining the Consumer Price Index and trying to keep it current 
represent a major investment of government time and money. We now examine ways 
in which the index is used, as well as some of its problems. 


14.2 Uses of the Consumer Price Index 


Most Americans, whether they realize it or not, are affected by the Consumer Price 
Index. It is the most widely used measure of inflation in the United States. 


Major Uses of the Consumer Price Index 


There are four major uses of the Consumer Price Index: 


1. The CPI is used to evaluate and determine economic policy. 
2. The CPI is used to compare prices in different years. 
3. The CPI is used to adjust other economic data for inflation. 


4. The CPI is used to determine salary and price adjustments. 
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1. The CPI is used to evaluate and determine economic policy. As a measure of in- 
flation, the CPI is of interest to the president, Congress, the Federal Reserve Board, 
private companies, and individuals. Government officials use it to evaluate how well 
current economic policies are working. Private companies and individuals also use 
it to make economic decisions. 


2. The CPI is used to compare prices in different years. If you bought a new car in 
1983 for $10,000, what would you have paid in 1991 for a similar quality car, using 
the CPI to adjust for inflation? The CPI in 1983 was very close to 100 (depending 
on the month), and the average CPI in 1991 was 136.2. Therefore, an equivalent 
price in 1991 would be about $10,000 X (136.2/100) = $13,620. The general for- 
mula for determining the comparable price for two different time periods is 


price at time 2 = (price at time 1) X [(CPI at time 2)/(CPI at time 1)] 


For this formula to work, all CPIs must be adjusted to the same base period. When 
the base period is updated, past CPIs are all adjusted to the new period. Thus, the 
CPIs of years that precede the current base period are generally less than 100; those 
of years that follow the current base period are generally over 100. Here are some 
CPIs since 1950, using the 1982-1984 base period: 


Year 1950 1960 1970 1975 1980 1985 1990 1995 2000 
CPI 24.1 29.6 38.8 53.8 82.4 107.6 130.7 152.4 172.2 


Example: If the rent for a particular apartment was $400 a month in 1990, what 
would the comparable rent be in 2000? 


price in 2000 = (price in 1990) X [(CPI in 2000)/(CPI in 1990)] 
price in 2000 = ($400) X (172.2/130.7) = $400 x 1.3175 = $527.00 


or about $527 per month. 


3. The CPI is used to adjust other economic data for inflation. If you were to plot 
just about any economic measure over time, you would see an increase simply be- 
cause—at least historically—the value of a dollar decreases every year. To provide 
a true picture of changes in conditions over time, most economic data are presented 
in values adjusted for inflation. You should always check plots of economic data over 
time to see if they have been adjusted for inflation. 


4. The CPI is used to determine salary and price adjustments. According to the Bu- 
reau of Labor Statistics: 


The CPI affects the income of almost 80 million persons, as a result of statu- 
tory action: 48.4 million Social Security beneficiaries, about 19.8 million food 
stamp recipients, and about 4.2 million military and Federal Civil Service re- 
tirees and survivors. Changes in the CPI also affect the cost of lunches for 26.5 
million children who eat lunch at school, while collective bargaining agree- 
ments that tie wages to the CPI cover over 2 million workers. (U.S. Dept. of 
Labor, 2003, CPI Web site) 


Because so many government wage and price increases are tied to the CPI, it has re- 
cently been the subject of scrutiny by Congress. An Advisory Committee to the U.S. 
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Senate Committee on Finance (U.S. Senate, 1996) has made numerous recommen- 
dations for changes, some of which are under consideration. One of these changes, 
implemented in January 1999, is discussed in the next section. 


14.3 Criticisms of the Consumer Price Index 


Although the Consumer Price Index may be the best measure of inflation available, 
it does have problems. Economists believe that it slightly overestimates increases in 
the cost of living. A study released in October 1994 by the Congressional Budget Of- 
fice estimated that “the CPI was overstating inflation by 0.2 percentage point to 0.8 
percentage point per year” (Associated Press, 20 October 1994). Further, the CPI 
may overstate the effect of inflated prices for the items it covers on the average 
worker’s standard of living. The following criticisms of the CPI should help you un- 
derstand these and other problems with its use. 


Some criticisms of the CPI: 

1. The market basket used in the CPI may not reflect current spending 
priorities. 

. If the price of one item rises, consumers are likely to substitute another. 

. The CPI does not adjust for changes in quality. 


. The CPI does not take advantage of sale prices. 


an A Ww N 


. The CPI does not measure prices for rural Americans. 


1. The market basket used in the CPI may not reflect current spending priorities. 
Remember that the market basket of goods and the weights assigned to them are 
changed infrequently. As of the end of 2003, the CPI was measuring the cost of items 
typically purchased in 1993-1995. This does not reflect rapid changes in lifestyle 
and technology. For example, cellular phones and CD players have become much 
more common in the past decade. You can probably think of many other changes that 
have occurred since 1995. 


2. If the price of one item rises, consumers are likely to substitute another. If the 
price of beef rises substantially, consumers will buy chicken instead. If the price 
of fresh vegetables goes up due to poor weather conditions, consumers will use 
canned or frozen vegetables until prices go back down. When the price of lettuce 
tripled a few years ago, consumers were likely to buy fresh spinach for salads in- 
stead. In the past, the CPI has not taken these substitutions into account. However, 
starting with the January 1999 CPI, substitutions within a subcategory such as “Ice 
cream and related products” have been taken into account through the use of a new 
statistical method for combining data. It is estimated that this change reduces the an- 
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nual rate of increase in the CPI by approximately 0.2 percentage point (U.S. Dept. 
of Labor, 16 April 1998). 


3. The CPI does not account for changes in quality. The CPI assumes that if you 
purchase the same items in the current year as you did in the base year, your stan- 
dard of living will be the same. That may apply to food and clothing but does not ap- 
ply to many other goods and services. For example, personal computers were not 
only more expensive in 1982-1984, they were also much less powerful. Owning a 
new personal computer now would add more to your standard of living than owning 
one in 1982 would have done. 


4. The CPI does not take advantage of sale prices. The outlets used to measure 
prices for the CPI are chosen by random sampling methods. The outlets consumers 
choose are more likely to be based on the best price that week. Further, if a super- 
market is having a sale on an item you use often, you will probably stock up on the 
item at the sale price and then not need to purchase it for a while. The CPI does not 
take this kind of money-saving behavior into account. 


5. The CPI does not measure prices for rural Americans. As mentioned earlier, the 
CPI is relevant for about 87% of the population: those who live in and around urban 
areas. It does not measure prices for the rural population and we don’t know whether 
it can be extended to that group. The costs of certain goods and services are likely 
to be relatively consistent. However, if the rise in the CPI in a given time period is 
mostly due to rising costs particular to urban dwellers, such as the cost of public 
transportation and apartment rents, then it may not be applicable to rural consumers. 


The Bureau of Labor Statistics notes that the CPI is not a cost-of-living index and 
should not be interpreted as such. It is most useful for comparing prices of similar 
products in the same geographic area across time. The BLS routinely studies and im- 
plements changes in methods that lead to improvements in the CPI. Current infor- 
mation about the CPI can be found on the CPI pages of the BLS Web site; that 
address, as of this writing, is www.bls.gov/cpi/. 


14.4 Economic Indicators 


The Consumer Price Index is only one of many economic indicators produced or 
used by the U.S. government. Historically, the Bureau of Economic Analysis (BEA), 
part of the Department of Commerce, classified and monitored a whole host of such 
indicators. Stratford and Stratford (1992, pp. 36-38) put together a table listing 103 
economic indicators accepted by the BEA as of February 1989. In 1995, the Depart- 
ment of Commerce turned over the job of producing and monitoring some of these 
indicators to the not-for-profit, private organization called The Conference Board. 
Most economic indicators are series of data collected across time, like the CPI. 
Some of them measure financial data, others measure production, and yet others 
measure consumer behavior. Here is a list of 10 series, randomly selected by the au- 
thor from the previously mentioned table provided by Stratford and Stratford, to give 
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you an idea of the variety. The letters in parentheses will be explained in the fol- 
lowing section. 


09 Construction contracts awarded for commercial and industrial buildings, 
floor space (L,C,U) 

10 Contracts and orders for plant and equipment in current dollars (L,L,L) 
14 Current liabilities of business failure (L,L,L) 

25 Changes in manufacturers’ unfilled orders, durable goods industries 
(L,L,L) 

27 Manufacturers’ new orders in 1982 dollars, nondefense capital goods in- 
dustries (L,L,L) 

39 Percent of consumer installment loans delinquent 30 days or over (L,L,L) 
51 Personal income less transfer payments in 1982 dollars (C,C,C) 

84 Capacity utilization rate, manufacturing (L,C,U) 

110 Funds raised by private nonfinancial borrowers in credit markets (L,L,L) 
114 Discount rate on new issues of 91-day Treasury bills (C,LG,LG) 


You can see that even this randomly selected subset covers a wide range of infor- 
mation about government, business and consumer behavior, and economic status. 


Leading, Coincident, and Lagging Indicators 


Most indicators move with the general health of the economy. The Conference 
Board classifies economic indicators according to whether their changes precede, 
coincide with, or lag behind changes in the economy. 

A leading economic indicator is one in which the highs, lows, and changes tend 
to precede or lead similar changes in the economy. (Contrary to what you may have 
thought, the term does not convey that it is one of the most important economic in- 
dicators.) A coincident economic indicator is one with changes that coincide with 
those in the economy. A lagging economic indicator is one whose changes lag be- 
hind or follow changes in the economy. 

To further complicate the situation, some economic indicators have highs that 
precede or lead the highs in the economy but have lows that are coincident with or 
lag behind the lows in the economy. Therefore, the indicators are further classified 
according to how their highs, lows, and changes correspond to similar behavior in 
the economy. The sample of 10 indicators shown in the previous section are classi- 
fied this way. The letters following each indicator show how the highs, lows, and 
changes, respectively, are classified for that series. 

The code letters are L = Leading, LG = Lagging, C = Coincident, and U = 
Unclassified. For example, the code letters in indicator 10, “Contracts and orders for 
plant and equipment in current dollars (L,L,L),” show that this indicator leads the 
economy in all respects. When this series remains high, remains low, or changes, the 
economy tends to follow. In contrast, the code letters in indicator 114, “Discount rate 
on new issues of 91-day Treasury bills (C,LG,LG),” show that this indicator has 
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highs that are coincident with the economy but has lows and changes that tend to lag 
behind the economy. 


Composite Indexes 


Rather than require decision makers to follow all of these series separately, the Con- 
ference Board produces composite indexes. The Index of Leading Economic Indica- 
tors comprises 11 series, listed in Table 14.2. Most, but not all, of the individual 
component series are collected by the U.S. government. For instance, the Index of 
Stock Prices is provided by Standard and Poor’s Corporation, whereas the Index of 
Consumer Expectations is provided by the University of Michigan’s Survey Re- 
search Center. 

The Index of Coincident Economic Indicators comprises four series; the Index of 
Lagging Economic Indicators, seven series. These indexes are produced monthly, 
quarterly, and annually. 

Behavior of the Index of Leading Economic Indicators is thought to precede that 
of the general economy by about 6 to 9 months. This is based on observing past per- 
formance and not on a causal explanation—that is, it may not hold in the future be- 
cause there is no obvious cause and effect relationship. In addition, monthly changes 
can be influenced by external events that may not predict later changes in the econ- 
omy. Nonetheless, the Index of Leading Economic Indicators is followed closely and 
used as a predictor of things to come. A news story hints at how this index influ- 
ences, and is influenced by, events: 


Index of Leading Indicators shows small decline during February. . . . The gov- 
ernment’s chief forecasting gauge of future economic activity suffered its first 
decline in seven months, the Commerce Department reported today. Much of 
the weakness was blamed on severe winter weather. . . . Today’s report pro- 
vided some assurance to jittery investors who have been dumping stocks and 
bonds out of fears that the economy was growing so rapidly it would trigger 
higher inflation. [Davis (CA) Enterprise, 5 April 1994, p. A10] 


Table 14.2 Components of the Index of Leading Economic Indicators 


1. Average workweek of production workers in manufacturing 

2. Average weekly initial claims for state unemployment insurance 

3. New orders for consumer goods and materials, adjusted for inflation 

4. Vendor performance (companies receiving slower deliveries from suppliers) 
5. Contracts and orders for plant and equipment, adjusted for inflation 

6. New building permits issued [private housing units] 

7. Change in manufacturers’ unfilled orders, durable goods 

8. Change in sensitive materials prices 
9. Index of stock prices 
0. Money supply: M-2, adjusted for inflation 
1 


1 
11. Index of consumer expectations 


Source: World Almanac and Book of Facts, 1993, p. 136. 
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Stratford and Stratford (1992, pp. 43-45) discuss other reasons why the Index may 
be limited as a predictor of the economic future. For instance, they note that about 
75% of all jobs in the United States are now in the service sector, yet the Index fo- 
cuses on manufacturing. Nevertheless, although the Index may not be ideal, it is still 
the most commonly quoted source of predictions about future economic behavior. 


Did Wages Really Go Up in the Reagan-Bush Years? 


It was the fall of 1992 and the United States presidential election was imminent. The 
Republican incumbent, George Bush (Senior), had been president for the past 4 
years, and vice president to Ronald Reagan for 8 years before that. One of the ma- 
jor themes of the campaign was the economy. Despite the fact that the federal bud- 
get deficit had grown astronomically during those 12 Reagan—Bush years, the 
Republicans argued that Americans were better off in 1992 than they had been 12 
years earlier. One of the measures they used to illustrate their point of view was the 
average earnings of workers. The average wages of workers in private, nonagricul- 
tural production had risen from $235.10 per week in 1980 to $345.35 in 1991 (World 
Almanac and Book of Facts, 1995, p. 150). 

Were those workers really better off because they were earning almost 50% more 
in 1991 than they had been in 1980? Supporters of Democratic challenger Bill Clin- 
ton didn’t think so. They began to counter the argument with some facts of their own. 

Based on the material in this chapter, you can decide for yourself. The Consumer 
Price Index in 1980, measured with the 1982—1984 baseline, was 82.4. For 1991, it 
was 136.2. Let’s see what average weekly earnings in 1991 should have been, ad- 
justing for inflation, to have remained constant with the 1980 average: 


salary at time 2 = (salary at time 1) X [(CPI at time 2)/(CPI at time 1)] 
salary in 1991 = ($235.10) X [(136.2)/(82.4)] = $388.60 


Therefore, the average weekly salary actually dropped during those 11 years, ad- 
justed for inflation, from the equivalent of $388.60 to $345.35. The actual average 
was only 89% of what it should have been to have kept up with inflation. 

There is another reason why the argument made by the Republicans would sound 
convincing to individual voters. Those voters who had been working in 1980 may 
very well have been better off in 1991, even adjusting for inflation, than they had 
been in 1980. That’s because those workers would have had an additional 11 years 
of seniority in the workforce, during which their relative positions should have im- 
proved. A meaningful comparison, which uses average wages adjusted for inflation, 
would not apply to an individual worker. 


Exercises 


Asterisked (*) exercises are included in the Solutions at the back of the book. 


1. The price of a first-class stamp in 1970 was 8 cents, whereas in 2002 it was 37 
cents. The Consumer Price Index for 1970 was 38.8, whereas for 2002 it was 
172.2. If the true cost of a first-class stamp did not increase between 1970 and 


*2. 


*3. 


ET 


CHAPTER 14 Reading the Economic News 277 


2002, what should it have cost in 2002? In other words, what would an 8-cent 
stamp in 1970 cost in 2002, when adjusted for inflation? 


The CPIs at the start of each decade from 1940 to 2000 were 


Year 1940 1950 1960 1970 1980 1990 2000 
CPI 14.0 24.1 296 38.8 82.4 130.7 172.2 


*a, Determine the percentage increase in the CPI for each decade. 


b. During which decade was inflation the highest, as measured by the percent- 
age change in the CPI? 


c. During which decade was inflation the lowest, as measured by the percent- 
age change in the CPI? 

A paperback novel cost $1.50 in 1968, $3.50 in 1981, and $6.99 in 1995. Com- 

pute a “paperback novel price index” for 1981 and 1995 using 1968 as the base 

year. In words that can be understood by someone with no training in statistics, 

explain what the resulting numbers mean. 


. When the CPI was computed for December 2002, the relative weight for the 


food and beverages category was 15.6%, whereas for the recreation category it 
was only 5.9%. Explain why food and beverages received higher weight than 
recreation. 


. Remember that the CPI is supposed to measure the change in what it costs to 


maintain the same standard of living that was in effect during the base year(s). 
Using the material in Section 14.3, explain why it may not do so accurately. 


. Americans spent the following amounts for medical care between 1987 and 


1993, in billions of dollars (World Almanac and Book of Facts, 1995, p. 128). 


Year 1987 1988 1989 1990 1991 1992 1993 
Amount spent 399.0 487.7 536.4 597.8 651.7 704.6 760.5 


a. Create a “medical care index” for each of these years, using 1987 as a base. 


b. Comment on how the cost of medical care has changed between 1987 and 
1993, relative to the change in the Consumer Price Index, which was 113.6 
in 1987 and 144.5 in 1993. 


Suppose that you gave your niece a check for $50 on her 16th birthday in 1997 
(when the CPI was 160.5). Your nephew is now about to turn 16. You discover 
that the CPI is now 180. How much should you give your nephew if you want 
to give him the same amount you gave your niece, adjusted for inflation? 


. In explaining why it is a costly mistake to have the CPI overestimate inflation, 


the Associated Press (20 October 1994) reported, “Every | percentage point in- 
crease in the CPI raises the federal budget deficit by an estimated $6.5 billion.” 
Explain why that would happen. 


. Find out what the tuition and fees were for your school for the previous 4 years 


and the current year. Using the cost 5 years ago as the base, create a “tuition in- 
dex” for each year since then. Write a short summary of your results that would 
be understood by someone who does not know what an index number is. 
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10. 


11. 


*12. 


13. 


14. 


15. 


16. 


In addition to the overall CPI, the BLS reports the index for the subcategories. 
The overall CPI in September 2003 was 185.2. Following are the values for 
some of the subcategories, taken from the CPI Web site: 


Dairy products 170.3 
Fruits, vegetables 222.4 
Alcoholic beverages 187.9 
Rent 206.6 
House furnishings 125.2 
Footwear 120.3 
Tobacco products 468.7 


All of these values are based on using 1982-1984 as the base period, the 
same period used by the overall CPI. Write a brief report explaining this infor- 
mation that could be understood by someone who does not know what index 
numbers are. 


As mentioned in this chapter, both the base year and the relative weights used 
for the Consumer Price Index are periodically updated. 


a. Why is it important to update the relative weights used for the CPI? 
b. Explain why the base year is periodically updated. 


Most newspaper accounts of the Consumer Price Index report the percentage 
change in the CPI from the previous month rather than the value of the CPI it- 
self. Why do you think that is the case? 


The Bureau of Labor Statistics reports that one use of the Consumer Price Index 
is to periodically adjust the federal income tax structure, which sets higher tax 
rates for higher income brackets. According to the BLS, “these adjustments pre- 
vent inflation-induced increases in tax rates, an effect called ‘bracket creep’” 
(U.S. Dept. of Labor, 2003, CPI Web site). Explain what is meant by “bracket 
creep” and how you think the CPI is used to prevent it. 


Many U.S. government payments, such as social security benefits, are increased 
each year by the percentage change in the CPI. In 1995, the government started 
discussions about lowering these increases or changing the way the CPI is cal- 
culated. According to an article in the New York Times, “most economists who 
have studied the issue closely say the current system is too generous to Federal 
beneficiaries . . . the pain of lower COLAs [cost of living adjustments] would be 
unavoidable but nonetheless appropriate” (Gilpin, 1995, p. D19). Explain in 
what sense some economists believe the current system is too generous. 


One of the components of the Index of Leading Economic Indicators is the In- 
dex of Consumer Expectations. Why do you think this index would be a leading 
economic indicator? 


Examine the 11 series that make up the Index of Leading Economic Indicators, 
listed in Table 14.2. Choose at least two of the series to support the explanation 
given by the government in March 1994 that the drop in these indicators in Feb- 
ruary was partially due to severe winter weather. 


*17. 


18. 


19. 


20. 
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Two of the economic indicators measured by the U.S. government are “Number 
of employees on nonagricultural payrolls” and “Average duration of unemploy- 
ment, in weeks.” One of these is designated as a “lagging economic indicator” 
and the other is a “coincident economic indicator.” Explain which you think is 
which, and why. 


An article in the Sacramento Bee (Stafford, 2003) on July 7, 2003 reported that 
the current minimum wage is only $5.15 an hour and that it has not kept pace 
with inflation. The Consumer Price Index at the time (the end of June 2003) was 
183.7. 


a. One of the quotes in the article was “to keep pace with inflation since 1968, 
the minimum wage should be $8.45 an hour today.” In 1968 the minimum 
wage was $1.60 an hour and the Consumer Price Index was 34.8. Explain 
how the author of the article determined that the minimum wage should be 
$8.45 an hour. 


b. The minimum wage was initiated in October 1938 at $0.25 an hour. The 
Consumer Price Index in 1938 was 14.1, using the 1982-1984 base years. If 
the minimum wage had kept pace with inflation from its origination, what 
should it have been at the end of June 2003? Compare your answer to the ac- 
tual minimum wage of $5.15 an hour in June 2003. 


c. The minimum wage of $5.15 an hour was set in September 1997 when the 
Consumer Price Index was 161.2. If it had kept pace with inflation, what 
should it have been at the end of June 2003? 


d. Based on your answers to parts a to c, has the minimum wage always kept 
pace with inflation, never kept pace with inflation, or some combination? 


Refer to the previous exercise. Find out the current minimum wage and the cur- 
rent Consumer Price Index. (These were available as of November 2003 at the 
Web sites http://www.dol.gov/esa/minwage/chart.htm and http://www.bls.gov/ 
cpi/, respectively.) Determine what the minimum wage should be at the current 
time if it had kept pace with inflation from 


a. 1950, when the CPI was 24.1 and the minimum wage was $0.75 an hour. 
b. 1960, when the CPI was 29.6 and the minimum wage was $1.00 an hour. 
c. 1990, when the CPI was 130.7 and the minimum wage was $3.80 an hour. 


The United States Census Bureau, Statistical Abstract of the United States 1999, 

p. 877, contains a table listing median family income for each year from 1947 

to 1997. The incomes are presented “in current dollars” and “in constant (1997) 

dollars.” As an example, the median income in 1985 in “current dollars” was 

$27,735 and in “constant (1997) dollars” it was $41,371. The CPI in 1985 was 

107.6 and in 1997 it was 160.5. 

a. Using these figures for 1985 as an illustration, explain what is meant by “in 
constant (1997) dollars.” 

b. The median family income in 1997 was $44,568. After adjusting for infla- 
tion, compare the 1985 and 1997 median incomes. Report the percent in- 
crease or decrease from 1985 to 1997. 
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c. Name one advantage to reporting the incomes in “current dollars” and one 
advantage to reporting the incomes in “constant dollars.” 


21. In 1950, being a millionaire was touted as a goal that would be achievable by 


very few people. The CPI in 1950 was 24.1, and in 2002 it was 179.9. How 
much money would one need to have in 2002 to be the equivalent of a million- 
aire in 1950, adjusted for inflation? Does it still seem like a goal achievable by 
very few people in 2002? 


Mini-Projects 


. Numerous economic indicators are compiled and reported by the U.S. govern- 


ment and by private companies. Various sources are available at the library and 
on the Internet to explain these indicators. Write a report on one of the follow- 
ing. Explain how it is calculated, what its uses are, and what limitations it might 
have. 


a. The Producer Price Index 
b. The Gross Domestic Product 


c. The Dow Jones Industrial Average 


2. Find a news article that reports on current values for one of the indexes discussed 


in this chapter. Discuss the news report in the context of what you have learned 
in this chapter. For example, does the report contain any information that might 
be misleading to an uneducated reader? Does it omit any information that you 
would find useful? Does it provide an accurate picture of the current situation? 
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Understanding 
and Reporting Trends 
over lime 


THOUGHT QUESTIONS 


1. What do you think is meant by the term time series? 


2. What do you think it means when a monthly economic indicator, such as new hous- 
ing starts, is reported as having been seasonally adjusted? 


3. If you were to plot number of ice cream cones sold versus month for 5 years, do you 
think the plot would show peaks and valleys, or would sales be relatively constant 
across all months? Explain. 

4. If someone is trying to get you to invest in his or her company and shows you a plot 
of sales or profits over time, what features of the picture do you think you should 
critically evaluate before you decide to invest? 
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15.1 Time Series 


Figure 15.1 

An example of a time 
series plot: Jeans sales in 
the United Kingdom 
from 1980 to 1984 


Source: Hand et al., 1994, 
p. 314. 


We have already seen examples of time series in Chapter 14, although we did not 
call them by that name. A time series is simply a record of a variable across time, 
usually measured at equally spaced time intervals. Most of the economic indicators 
discussed in Chapter 14 are time series that are measured monthly. 

To understand data presented across time, it is important to know how to recog- 
nize the various components that can contribute to the ups and downs in a time se- 
ries. Otherwise, you could mistake a temporary high in a cycle for a permanent 
increasing trend and make a very unwise economic decision. 


A Time Series Plot 


Figure 15.1 illustrates a time series plot. The data represent monthly sales of jeans 
in Britain for the 5-year period from January 1980 to December 1984. Notice that 
the data points have been connected to make it easier to follow the ups and downs 
across time. Data are measured in thousands of pairs sold. Month 1 is January 1980, 
and month 60 is December 1984. 


Improper Presentation of a Time Series 


Before we investigate the components in Figure 15.1 (and other time series), let’s 
look at one way in which you can be fooled by improper presentation of a time se- 
ries. In Figure 15.2, a subset of the time series is displayed. 
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Figure 15.2 

Distortion caused by dis- 
playing only part of a 
time series: Jeans sales 
for 21 months 
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Suppose an unscrupulous entrepreneur was anxious to have you invest your hard- 
earned savings into his blue jeans company. To convince you that sales of jeans can 
only go up, he presents you with a limited set of data—from October 1982 to June 
1984. With only those few months shown, it appears that the basic trend is way up! 

A less obvious version of this trick is to present data up to the present time but 
to start the plot of the series at an advantageous point. Be suspicious of time series 
showing returns on investments that look too good to be true. They probably are. No- 
tice when the time series begins and compare that with your knowledge of recent 
economic cycles. 


15.2 Components of Time Series 


Most time series have the same four basic components: long-term trend, seasonal 
components, irregular cycles, and random fluctuations. Let’s examine each of these 
in turn. 


Long-Term Trend 


Many time series measure variables that either increase or decrease steadily across 
time. This steady change is called a trend. If the trend is even moderately large, it 
should be obvious by looking at a plot of the series. Figure 15.1 clearly shows an in- 
creasing trend for jeans sales. 

If the long-term trend is linear, we can estimate it by finding a regression line, 
with time period as the explanatory variable and the variable in the time series as the 


Figure 15.3 

An example of a de- 
trended time series: 
Jeans sales with trend 
removed 
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response variable. We can then remove the trend to enable us to see what other in- 
teresting features exist in the series. When we do that, the result is, aptly enough, 
called a detrended time series. 

The regression line for the data in Figure 15.1 is 


sales = 1880 + 6.62 (months) 


Notice that month 1 is January 1980 and month 60 is December 1984. If we were to 
try to forecast sales for January 1985, the first month that is not included in the se- 
ries, we would use months = 61. The resulting value is 2284 thousand pairs of jeans. 
Actual sales were 2137 thousand pairs. Our prediction is not far off, given that, over- 
all, the data range from about 1600 to 3100 thousand pairs. Notice one reason why 
the actual value may be slightly lower than the predicted value; sales tend to be lower 
during the winter months. We look at that seasonal component next. 

The regression line indicates that the trend, on average, shows sales increasing 
by about 6.62 units per month. Because the units represent thousands of pairs, the 
actual increase is about 6620 pairs per month. Figure 15.3 presents the time series 
for jeans sales with the trend removed. Compare Figure 15.3 with Figure 15.1. No- 
tice that the fluctuations remaining in Figure 15.3 are similar in character to those in 
Figure 15.1, but the upward trend is gone. 

Let’s look at what we would have estimated as the trend if we had been fooled 
by the picture in Figure 15.2. We would have predicted a much higher increase per 
month. The regression line for the data in Figure 15.2 is 


sales = 1832 + 32.1 (months) 
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In other words, the trend is estimated to show an increase of 32,100 pairs a month, 
compared with 6620 pairs computed from the full time series. 


Seasonal Components 


Most time series involving economic data or data related to people’s behavior also 
have seasonal components. In other words, they tend to be high in certain months 
or seasons and low in others every year. For example, new housing starts are much 
higher in warmer months. Sales of toys and other standard gifts are much higher just 
before Christmas. U.S. unemployment rates tend to rise in January, when outdoor 
jobs are minimal and the Christmas season is over, and again in June, when a new 
graduating class enters the job market. 

Most of the economic indicators discussed in Chapter 14 are subject to seasonal 
fluctuations. As we shall see in Section 15.3, they are usually reported after they 
have been seasonally adjusted. 

Notice that there is indeed a seasonal component to the time series of sales of 
jeans. It is evident in both Figure 15.1 and Figure 15.3. Sales appear to peak during 
June and July and reach a low in October every year. Manufacturers need to know 
that information. Otherwise, they might mistake increased sales during June, for ex- 
ample, as a general trend and overproduce their product. 

Economists have sophisticated methods for seasonally adjusting time series. 
They use data from the same month or season in prior years to construct a seasonal 
factor, which is a number either greater than one or less than one by which the cur- 
rent figure is multiplied. According to the U.S. Department of Labor (1992, p. 243), 
“the standard practice at BLS for current seasonal adjustment of data, as it is initially 
released, is to use projected seasonal factors which are published ahead of time.” In 
other words, when figures such as the Consumer Price Index become available for a 
given month, the BLS already knows the amount by which the figures should be ad- 
justed up or down to account for the seasonal component. 


Irregular Cycles and Random Fluctuations 


There are two remaining components of time series: the irregular (but smooth) cy- 
cles that economic systems tend to follow and unexplainable random fluctuations. It 
is often hard to distinguish between these two components, especially if the cycles 
are not regular. 

Figure 15.4 shows the U.S. unemployment rate, seasonally adjusted, for each 
January from 1950 to 1982. Notice the definite irregular cycles during which un- 
employment rates rise and fall over a number of years. Some of these can be at least 
partially explained by social and political factors. For example, the Vietnam War era 
spanned the years from the mid-1960s to the early 1970s. The mandatory draft ended 
in 1973, freeing many young men to enter the job market. You can see a decreasing 
cycle during the Vietnam War years that ends in 1972. 

The random fluctuations in a time series are defined as what’s left over when 
the other three components have been removed. They are part of the natural vari- 
ability present in all measurements. Notice from Figures 15.1 and 15.3 that even if 


Figure 15.4 

An example of a time 
series with irregular cy- 
cles: Adjusted January 
unemployment rates, 
1950-1982 


Source: Based on data from 
Miller, 1988, p. 284. 
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you were to account for the trend, seasonal components, and smooth irregular cy- 
cles, you would still not be able to perfectly explain the jeans sales each month. The 
remaining, unexplainable components are labeled random fluctuations. 


15.3 Seasonal Adjustments: 
Reporting the Consumer Price Index 


It is unusual to see the Consumer Price Index itself reported in the news. More com- 
monly, what you see reported is the change from the previous month, which is gen- 
erally reported in the middle of each month. Following is an example of how it was 
reported in the New York Times: 


Consumer Prices Rose 0.3% in June Washington, July 13—Consumer prices 
climbed three-tenths of 1 percent in June, as increases for cars, gasoline, air 
fares and clothing more than offset moderation in housing, the Labor Depart- 
ment reported today. (Hershey, 19 July 1994, p. C1) 


Most news reports never tell you the actual value of the CPI. In this report, the CPI 
itself was finally given at the end of a long article, in the following paragraph: 


The index now stands at 148.0, meaning that an array of goods and services 
that cost $10 in the 1982—84 reference period now costs $14.80. The value of 
the 1982—84 dollar is now 67.6 cents. (Hershey, 19 July 1994, p. C17) 
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One piece of information is blatantly missing from the article itself. You find it 
only when you read the accompanying graph, which shows the change in the CPI for 
the previous 12 months. The heading on the graph reads: “Consumer prices—percent 
change, month to month, seasonally adjusted” (italics added). In other words, the 
change of 0.3% does not represent an absolute change in the Consumer Price Index; 
rather it represents a change after seasonal adjustments have been made. Adjustments 
have already been made for the fact that certain items are expected to cost more dur- 
ing certain months of the year. According to the BLS Handbook of Methods: 


An economic time series may be affected by regular intrayearly (seasonal) 
movements which result from climatic conditions, model changeovers, vaca- 
tion practices, holidays, and similar factors. Often such effects are large 
enough to mask the short-term, underlying movement of the series. If the ef- 
fect of such intrayearly repetitive movements can be isolated and removed, 
the evaluation of a series may be made more perceptive. (U.S. Dept. of La- 
bor, 1992, p. 243) 


The BLS Handbook is thus recognizing what you should admit as common sense. It 
is important that economic indicators be reported with seasonal adjustments; other- 
wise, it would be impossible to determine the direction of the real trend. For exam- 
ple, it would probably always appear as if new housing starts dipped in February and 
jumped in May. Therefore, it is prudent reporting to include a seasonal adjustment. 


Why Are Changes in the CPI Big News? 


You may wonder why changes in the CPI are reported as the big news. The reason 
is that financial markets are extremely sensitive to changes in the rate of inflation. 
The same New York Times article quoted earlier reported: 


Unlike Tuesday’s surprisingly favorable report that prices at the producer level 
were unchanged last month, the C.P.I. data provided little comfort to the ma- 
jority of analysts, who say that inflation—higher in June than in either April or 
May—has begun a gradual upswing, and that the Federal Reserve will need to 
raise short-term interest rates again by mid-August. (Hershey, 19 July 1994, 

p. Cl) 


Like anything else in the world, it is the changes that attract concern and attention, 
not the continuation of the status quo. 


15.4 Cautions and Checklist 


Some time series data are not adjusted for inflation or for seasonal components. 
When you read a report of an economic indicator, you should check to see if it has 
been adjusted for inflation and/or seasonally adjusted. You can then compensate for 
those factors in your interpretation. 


EXAMPLE 1 


CASE STUDY 15.1 
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The Dow Jones Industrial Average 


The Dow Jones Industrial Average (DJIA) is a weighted average of the price of 30 major 
stocks on the New York Stock Exchange. It reached an all-time high of $11,722.98 on 
January 14, 2000. In fact, it reaches an all-time high almost every year. But the DJIA is 
not adjusted for inflation; it is simply reported in current dollars. Thus, to compare the 
high in one year with that in another, we need to adjust it using the CPI. 

For example, in 1970, the high for the DJIA was $842.00, occurring on December 
29. The high in 1993 was $3794.33, also occurring on December 29. The CPI in 1970 
was 38.8, whereas in 1993 it was 144.5. Did the DJIA rise faster than inflation? To de- 
termine if it did, let's calculate what the 1970 high would have been in 1993 dollars: 


value in 1993 = (value in 1970) x [(CPI in 1993)/(CPI in 1970)] 
= ($842.00) x (144.5/38.8) = $3135.80 
Therefore, the high of $3794.33 cannot be completely explained by inflation. If we take 
the ratio of the two numbers, we find $3794.33/$3135.80 = 1.21. In other words, the 


increase in the DJIA highs from 1970 to 1993 is 21% after adjusting for inflation using 
the Consumer Price Index. a 


Checklist for Reading Time Series Data 


When you see a plot of a variable across time or when you read about a monthly 
change in a time series, keep in mind that you could be misled in various ways. 


You should ask the following questions when reading time series data: 


1. Are the time periods equally spaced? 
2. Is the series adjusted for inflation? 

3. Are the values seasonally adjusted? 
4 


. Does the series cover enough of a time span to represent typical long-term 
behavior? 


5. Is there an upward or downward trend? 
6. Are there other seasonal components that have not been removed? 


7. Are there smooth cycles? 


Based on your answers to those questions, you may need to calculate or approximate 
an adjustment to the data reported. 


If You’re Looking for a Job, Try May and October 
Source: Miller (1988), pp. 283-285. 


How much do you think unemployment rates fluctuate from season to season? Do 
you think they reach a yearly high and low during the same month each year? In 
Figure 15.4, we saw that unemployment rates tend to follow cycles over time. In 
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Figure 15.5 
Unemployment rates for 
1977-1981 before be- 
ing seasonally adjusted 
Source: Miller, 1988. 
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this case study, we look at a 5-year period only, to see the effect of monthly 
components. 

Figure 15.5 shows the monthly unemployment rates from January 1977 to De- 
cember 1981. Each month has been coded with a letter: A = January, B = February, 
and so on. Notice that there are definite monthly components. For each of the 5 
years, a sharp increase occurs between December (L) and January (A) and another 
between May (E) and June (F). The yearly lows occur in the spring, particularly in 
May and in the fall. 

Figure 15.6 shows the official, seasonally adjusted unemployment rates for the 
same time period. Notice that the extremes have been removed by the process of sea- 
sonal adjustment. One month no longer dominates each year as the high or the low. 
In fact, the series in Figure 15.6 shows much less variability than the one in Figure 
15.5. Much of the variability in Figure 15.5 was not due to random fluctuations but 
to explainable monthly components. The remaining variability apparent in Figure 
15.6, after adjusting for the obvious trend, can be attributed to random monthly 
fluctuations. 

As a final note, compare Figure 15.6 with Figure 15.4, in which January unem- 
ployment rates from 1950 to 1982 were presented. Notice that the downward trend 
in Figure 15.6 is simply part of the longer cyclical behavior of unemployment rates. 
It would not be projected to continue indefinitely. In fact, by 1983, the yearly un- 
employment rate had risen to 9.7%, as the cyclical behavior evident in Figure 15.4 
continued. This example illustrates that what appears as a trend in a short time se- 
ries may actually be part of a cycle in a longer time series. For that reason, it is not 
wise to forecast a trend very far into the future. 


Figure 15.6 
Unemployment rates for 
1977-1981 after being 
seasonally adjusted 
Source: Miller, 1988. 
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Exercises 


Asterisked (*) exercises are included in the Solutions at the back of the book. 


*1. For each of the following time series, do you think the long-term trend would 
be positive, negative, or nonexistent? 


*a, 


*b, 
from 1960 to 2004. 


The cost of a loaf of bread measured monthly from 1960 to 2004. 


The temperature in Boston measured at noon on the first day of each month 


c. The price of a basic computer, adjusted for inflation, measured monthly from 


1970 to 2004. 


d. The number of personal computers sold in the United States measured 


monthly from 1970 to 20 


04. 


2. For each of the time series in Exercise 1, explain whether there is likely to be a 


3. 


*4, 


seasonal component. 


Global warming is a major concern because it implies that temperatures around 
the world are going up on a permanent basis. Suppose you were to examine a 
plot of monthly temperatures in one location for the past 50 years. Explain the 
role that the three time series components (trend, seasonal, cycles) would play 
in trying to determine whether global warming was taking place. 


If you were to present a time series of the yearly cost of tuition at your local col- 
lege for the past 30 years, would it be better to first adjust the costs for inflation? 


Explain. 


292 


PART2 Finding Life in Data 


5. 


*8. 


*10. 


11. 


12. 


13. 


14. 


*15. 


If you wanted to present a time series of the yearly cost of tuition at your local 
college for the past 30 years, adjusted for inflation, how would you do the 
adjustment? 


. The population of the United States rose from about 179 million people in 1960 


to about 281 million people in 2000. Suppose you wanted to examine a time se- 
ries to see if homicides had become an increasing problem over that time period. 
Would you simply plot the number of homicides versus time, or is there a bet- 
ter measure to plot against time? Explain. 


. Many statistics related to births, deaths, divorces, and so on across time are re- 


ported as rates per 100,000 of population rather than as actual numbers. Explain 
why those rates may be more meaningful as a measure of change across time 
than the actual numbers of those events. 


Suppose a time series across 60 months has a long-term positive trend. Would 
you expect to find a correlation between the values in the series and the months 
1 to 60? If so, can you tell from the information given whether it would be pos- 
itive or negative? 


. Explain which one of the components of an economic time series would be most 


likely to be influenced by a major war. (See Section 15.2.) 
Discuss which of the three components of a time series (trend, seasonal, and cy- 
cles) are likely to be present in each of the following series, reported monthly 
for the past 10 years: 

*a. Unemployment rates 

*b. Hours per day the average child spends watching television 

*e. Interest rates paid on a savings account at a local bank 
Which of the three nonrandom components of time series (trend, seasonal, or 
cycles) is likely to contribute the most to the unadjusted Consumer Price Index? 
Explain. 
Draw an example of a time series that has 
a. Trend, cycles, and random fluctuations, but not seasonal components. 
b. Seasonal components and random fluctuations, but not trend or cycles. 
Explain why it is important for economic time series to be seasonally adjusted 
before they are reported. 


Suppose you have been hired as a salesperson, selling computers and software. 
In January, after 6 months on the job, your sales suddenly plummet. They had 
been high from August to December. Your boss, who is also new to the position, 
chastises you for this drop. What would you say to your boss to protect your 
job? 

The CPI in July 1977 was 60.9; in July 1994, it was 148.4. 


*a, The salary of the governor of California in July 1977 was $49,100; in July 
1994, it was $120,000. Compute what the July 1977 salary would be in 
July 1994, adjusted for inflation, and compare it with the actual salary in 
July 1994. 


16. 


17. 


18. 
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*b. The salary of the president of the United States in July 1977 was $200,000. 
In July 1994, it was still $200,000. Compute what the July 1977 salary would 
be in July 1994, adjusted for inflation, and compare it with the actual salary. 

The Dow Jones Industrial Average reached a high of $7801.63 on December 29, 
1997. Recall from Section 15.4 that it reached a high of $842.00 on December 
29, 1970. The Consumer Price Index averaged 38.8 for 1970; for 1997, it aver- 
aged 160.5. By what percentage did the high in the DJIA increase from De- 
cember 29, 1970, to December 29, 1997, after adjusting for inflation? 


Explain why it is important to examine a time series for many years before 
making conclusions about the contribution of each of the three nonrandom 
components. 


According to the World Almanac and Book of Facts (1995, p. 380), the popula- 
tion of Austin, Texas (reported in thousands), has grown as follows: 


Year 1950 1960 1970 1980 1990 
Population 132.5 186.5 253.5 345.5 465.6 


a. Of the three nonrandom components of time series (trends, seasonal, and cy- 
cles), which do you think would be most likely to explain the data if you 
were to see the population of Austin, Texas, by month, from 1950 to 1990? 
Explain. 

b. The regression equation relating the last two digits of each year (50, 60, and 
so on) to the population for Austin, Texas, is 


population = —301 + 8.25 (year) 


Use this equation to predict the population of Austin for the year 2000. 

c. Discuss the method you used for the prediction in part b. Draw a line graph 
showing year on the horizontal axis and population on the vertical axis. Does 
it look like a straight line describes the trend well? 

d. The population in 2000 (in thousands) was 656.6. (Source: U.S. Census Bu- 
reau Web site.) Was your prediction in part b very accurate? Explain why or 
why not. 


Mini-Projects 


. Plot your own resting pulse rate taken at regular intervals for 5 days. Comment 


on which of the components of time series are present in your plot. Discuss what 
you have learned about your own pulse from this exercise. 


. Find an example of a time series plot presented in a newspaper, magazine, jour- 


nal, or Web site. Discuss the plot based on the information given in this chapter. 
Comment on what you can learn from the plot. 


. In addition to the Dow Jones Industrial Average, there are other indicators of 


fluctuation in stock prices. Two examples are the New York Stock Exchange 
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Composite Index and the Standard and Poor’s 500. Choose a stock index (other 
than the Dow Jones) and write a report about it. Include whether it is adjusted 
for inflation, seasonally adjusted, or both. Give information about its recent per- 
formance, and compare it with performance a few decades ago. Make a conclu- 
sion about whether the stock market has gone up or down in that time period, 
based on the index you are using, adjusted for inflation. 
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Understanding 
Uncertainty in Life 


In Parts 1 and 2 of this book, you learned how data should be collected and 
summarized. Some simple ideas about chance were introduced in the context 
of whether chance could be ruled out as an explanation for a relationship ob- 
served in a sample. 

The purpose of the material in Part 3 is to acquaint you with some simple 
ideas about probability in ways that can be applied to your daily life. In Chap- 
ter 16, you will learn how to determine and interpret probabilities for simple 
events. You will also see that it is sometimes possible to make long-term pre- 
dictions, even when specific events can't be predicted well. In Chapters 17 and 
18, you will learn how psychological factors can influence judgments involving 
uncertainty. As a consequence, you will learn some hints that will help you 


make better decisions in your own life. 
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Understanding Probability 
and Long-Term Expectations 


Thought Questions 


1. 


Here are two very different queries about probability: 
a. If you flip a coin and do it fairly, what is the probability that it will land heads up? 


b. What is the probability that you will eventually own a home; that is, how likely do 
you think it is? (If you already own a home, what is the probability that you will 
own a different home within the next 5 years?) 


For which question was it easier to provide a precise answer? Why? 


. Explain what it means for someone to say that the probability of his or her eventu- 


ally owning a home is 70%. 


. Explain what's wrong with the following statement, given by a student as a partial 


answer to Thought Question 1b: “The probability that | will eventually own a home, 
or of any other particular event happening, is 1/2 because either it will happen or it 
won't.” 


. Why do you think insurance companies charge young men more than they do older 


men for automobile insurance, but charge older men more for life insurance? 


. How much would you be willing to pay for a ticket to a contest in which there was 


a 1% chance that you would win $500 and a 99% chance that you would win noth- 
ing? Explain your answer. 
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16.1 Probability 


The word probability is so common that in all probability you will run across it to- 
day in everyday language. But we rarely stop to think about what the word means. 
For instance, when we speak of the probability of winning a lottery based on buying 
a single ticket, are we using the word in the same way as when we speak of the prob- 
ability that we will eventually buy a home? In the first case, we can quantify the 
chances exactly. In the second case, we are basing our assessment on personal be- 
liefs about how life will evolve for us. The conceptual difference illustrated by these 
two examples leads to two distinct interpretations of what is meant by the term 
probability. 


16.2 The Relative-Frequency Interpretation 


The relative-frequency interpretation of probability applies to situations in which 
we can envision observing results over and over again. For example, it is easy to en- 
vision flipping a coin over and over again and observing whether it lands heads or 
tails. It then makes sense to discuss the probability that the coin lands heads up. It is 
simply the relative frequency, over the long run, with which the coin lands heads up. 

Here are some more interesting situations to which this interpretation of proba- 
bility can be applied: 


m Buying a weekly lottery ticket and observing whether it is a winner. 


= Commuting to work daily and observing whether a certain traffic signal is red 
when we encounter it. 


m Testing individuals in a population and observing whether they carry a gene for a 
certain disease. 


m Observing births and noting if the baby is male or female. 


The Idea of Long-Run Relative Frequency 


If we have a situation such as those just described, we can define the probability of 
any specific outcome as the proportion of time it occurs over the long run. This is 
also called the relative frequency of that particular outcome. 

Notice the emphasis on what happens in the long run. We cannot assess the prob- 
ability of a particular outcome by observing it only a few times. For example, con- 
sider a family with five children, in which only one child is a boy. We would not take 
that as evidence that the probability of having a boy is only 1/5. However, if we no- 
ticed that out of thousands of births only one in five of the babies were boys, then it 
would be reasonable to conclude that the probability of having a boy is only 1/5. 

According to the Information Please Almanac (1991, p. 815), the long-run rela- 
tive frequency of males born in the United States is about .512. In other words, over 
the long run, 512 male babies are born to every 488 female babies. Suppose we were 
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Table 16.1 Relative Frequency of Male Births 
Weeks of Watching 1 4 12 24 36 52 


Number of boys 12 47 160 310 450 618 
Number of babies 30 100 300 590 880 1200 
Proportion of boys 400 470 3933 525 511 .515 


to record births in a certain city for the next year. Table 16.1 shows what we might 
observe. Notice how the proportion, or relative frequency, of male births jumps 
around at first but starts to settle down to something just above .51 in the long run. 
If we had tried to determine the true proportion after just 1 week, we would have 
been seriously misled. 


Determining the Probability of an Outcome 
Method 1: Make an Assumption about the Physical World 


Two methods for determining the probability of a particular outcome fit the relative- 
frequency interpretation. The first method is to make an assumption about the phys- 
ical world and use it to determine the probability of an outcome. For example, we 
generally assume that coins are manufactured in such a way that they are equally 
likely to land with heads up or tails up when flipped. Therefore, we conclude that the 
probability of a flipped coin showing heads up is 1/2. (This probability is based on 
the assumption that the physics of the situation allows the coin to flip around enough 
to become unpredictable. With practice, you can learn to toss a coin to come out the 
way you would like more often than not.) 

As a second example, we can determine the probability of winning the lottery by 
assuming that the physical mechanism used to draw the winning numbers gives each 
number an equal chance. For instance, many state-run lotteries in the United States 
have participants choose three digits, each from the set 0 to 9. If the winning set is 
drawn fairly, each of the 1000 possible combinations should be equally likely. (The 
1000 possibilities are 000, 001, 002, . . . , 999.) Therefore, each time you play your 
probability of winning is 1/1000. You win only on those rare occasions when the set 
of numbers you chose is actually drawn. In the long run, that should happen about 1 
out of 1000 times. Notice that this does not mean it will happen exactly once in every 
thousand draws. 


Method 2: Observe the Relative Frequency 

The other way to determine the probability of a particular outcome is by observing 
the relative frequency over many, many repetitions of the situation. We used that 
method when we observed the relative frequency of male births in a given city over 
the course of a year. By using this method, we can get a very accurate figure for the 
probability that a birth will be a male. As mentioned, the relative frequency of male 
births in the United States has been consistently close to .512 (Information Please 
Almanac, 1991, p. 815). For example, in 1987 there were a total of 3,809,394 live 
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births in the United States, of which 1,951,153 were males. Therefore, in 1987 the 
probability that a live birth would result in a male was 1,951,153/3,809,394 = .5122. 

Sometimes relative-frequency probabilities are reported on the basis of sample 
surveys. In such cases, a margin of error should be included but often is not. For ex- 
ample, the World Almanac and Book of Facts (1993, p. 38), reported that “on any 
given day, 71 percent of Americans read a newspaper . . . according to a 1991 Gallup 
Poll.” 


Summary of the Relative-Frequency 
Interpretation of Probability 


m The relative-frequency interpretation of probability can be applied when a 
situation can be repeated numerous times, at least conceptually, and the 
outcome can be observed each time. 


m In scenarios for which this interpretation applies, the relative frequency 
with which a particular outcome occurs should settle down to a constant 
value over the long run. That value is then defined to be the probability of 
that outcome. 


m The interpretation does not apply to situations where the outcome one 
time is influenced by or influences the outcome the next time because the 
probability would not remain the same from one time to the next. We can- 
not determine a number that is always changing. 


m Probability cannot be used to determine whether the outcome will occur 
on a single occasion but can be used to predict the long-term proportion of 
the times the outcome will occur. 


Relative-frequency probabilities are quite useful in making daily decisions. For 
example, suppose you have a choice between two flights to reach your destination. 
All other factors are equivalent, but your travel agent tells you that one has a proba- 
bility of .90 of being on time, whereas the other has only a probability of .70 of be- 
ing on time. Even though you can’t predict the outcome for your particular flight, you 
would be likely to choose the one that has the better performance in the long run. 


16.3 The Personal-Probability Interpretation 


The relative-frequency interpretation of probability is clearly limited to repeatable 
conditions. Yet, uncertainty is a characteristic of most events, whether they are re- 
peatable under similar conditions or not. We need an interpretation of probability 
that can be applied to situations even if they will never happen again. 
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Will you fare better by taking calculus than by taking statistics? If you decide to 
drive downtown this Saturday afternoon, will you be able to find a good parking 
space? Should a movie studio release a potential new hit movie before Christmas, 
when many others are released, or wait until January, when it might have a better 
chance of being the top box-office attraction? Would a trade alliance with a new 
country cause problems in relations with a third country? 

These are unique situations, not likely to be repeated. They require people to 
make decisions based on an assessment of how the future will evolve. We could each 
assign a personal probability to these events, based on our own knowledge and ex- 
periences, and we could use that probability to help us with our decisions. We may 
not agree on what the probabilities of differing outcomes are, but none of us would 
be considered wrong. 


Defining Personal Probability 


We define the personal probability of an event to be the degree to which a given in- 
dividual believes the event will happen. There are very few restrictions on personal 
probabilities. They must fall between 0 and 1 (or, if expressed as a percentage, be- 
tween 0 and 100%). They must also fit together in certain ways if they are to be co- 
herent. By being coherent, we mean that your personal probability of one event 
doesn’t contradict your personal probability of another. For example, if you thought 
that the probability of finding a parking space downtown Saturday afternoon was 
.20, then to be coherent, you must also believe that the probability of not finding one 
is .80. We explore some of these logical rules later in this chapter. 


How We Use Personal Probabilities 


People routinely base decisions on personal probabilities. This is why committee de- 
cisions are often so difficult. For example, suppose a committee is trying to decide 
which candidate to hire for a job. Each member of the committee has a different as- 
sessment of the candidates, and each may disagree with the others on the probabil- 
ity that a particular candidate would fit the job best. We are all familiar with the 
problem juries sometimes have when trying to agree on someone’s guilt or inno- 
cence. Each member of the jury has his or her own personal probability of guilt and 
innocence. One of the benefits of committee or jury deliberations is that such delib- 
erations may help members reach some consensus in their personal probabilities. 
Personal probabilities often take relative frequencies of similar events into ac- 
count. For example, the late astronomer Carl Sagan believed that the probability of 
a major asteroid hitting the Earth soon is high enough to be of concern. “The prob- 
ability that the Earth will be hit by a civilization-threatening small world in the next 
century is a little less than one in a thousand” (Arraf, 14 December, 1994, p. 4). To 
arrive at that probability, Sagan obviously could not use the long-run frequency def- 
inition of probability. He would have to use his own knowledge of astronomy, com- 
bined with past asteroid behavior. (See Exercise 28 for an updated probability.) 
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16.4 Applying Some Simple Probability Rules 


EXAMPLE 1 


EXAMPLE 2 


EXAMPLE 3 


EXAMPLE 4 


Situations often arise where we already know probabilities associated with simple 
events, such as the probability that a birth will result in a girl, and we would like to 
find probabilities of more complicated events, such as the probability that we will 
eventually have at least one girl if we ultimately have four children. Some simple, 
logical rules about probability allow us to do this. 

These rules apply naturally to relative-frequency probabilities, and they must ap- 
ply to personal probabilities if they are to be coherent. For example, we can never 
have a probability below 0 or above 1. An impossible event has a probability of 0 
and a sure thing has a probability of 1. Here are four additional useful rules: 


Rule 1: If there are only two possible outcomes in an uncertain situation, then 
their probabilities must add to 1. 


If the probability of a single birth resulting in a boy is .51, then the probability of it 
resulting in a girl is .49. E 


If you estimate the chances that you will eventually own a home to be 70%, in or- 
der to be coherent (consistent with yourself) you are also estimating that there is a 
30% chance that you will never own one. a 


According to Krantz (1992), the probability that a piece of checked luggage will be 
temporarily lost on a flight with a U.S. airline is 1/176. Thankfully, that means the 
probability of finding the luggage waiting at the end of a trip is 175/176. E 


Rule 2: If two outcomes cannot happen simultaneously, they are said to be 
mutually exclusive. The probability of one or the other of two mutually ex- 
clusive outcomes happening is the sum of their individual probabilities. 


The two most common primary causes of death in the United States are heart at- 
tacks, which killed about 30% of the Americans who died in the year 2000, and 
various cancers, which killed about 23%. Therefore, if this year is like the year 2000, 
the probability that a randomly selected American who dies will die of either a heart 
attack or cancer is the sum of these two probabilities, or about 0.53 (53%). Notice 
that this is based on death rates for the year 2000 and could well change long be- 
fore you have to worry about it. This calculation also assumes that one cannot die 
simultaneously of both causes—in other words, the two causes of death are mutu- 
ally exclusive. Given the way deaths are recorded, this fact is actually guaranteed 


EXAMPLE 5 


EXAMPLE 6 


EXAMPLE 7 


EXAMPLE 8 
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because only one primary cause of death may be entered on a death certificate. 
(Source: National Center for Health Statistics) a 


If you estimate your chances of getting an A in your statistics class to be 50% and 
your chances of getting a B to be 30%, then you are estimating your chances of 
getting either an A or a B to be 80%. Notice that you are therefore estimating your 
chances of getting a C or less to be 20% by Rule 1. E 


If you estimate your chances of getting an A in your statistics class to be 50% and 
your chances of getting an A in your history class to be 60%, are you estimating 
your chances of getting one or the other, or both, to be 110%? Obviously not, be- 
cause probabilities cannot exceed 100%. The problem here is that Rule 2 stated ex- 
plicitly that the events under consideration couldn't happen simultaneously. 
Because it is possible for you to get an A in both courses simultaneously, Rule 2 does 
not apply here. In case you are curious, Rule 2 could be modified to apply. 
You would have to subtract the probability that both events happen, which would 
require you to estimate that probability as well. We see one way to do that using 
Rule 3. E 


Rule 3: If two events do not influence each other, and if knowledge about one 
doesn’t help with knowledge of the probability of the other, the events are said 
to be independent of each other. If two events are independent, the probabil- 
ity that they both happen is found by multiplying their individual probabilities. 


Suppose a woman has two children. Assume that the outcome of the second 
birth is independent of what happened the first time and that the probability that 
each birth results in a boy is .51, as observed earlier. Then the probability that 
she has a boy followed by a girl is (.51) X (.49) = .2499. In other words, there is 
about a 25% chance that a woman having two children will have a boy and then 
a girl. E 


From Example 6, suppose you continue to believe that your probability of getting 
an A in statistics is .5 and an A in history is .6. Further, suppose you believe that 
the grade you receive in one is independent of the grade you receive in the other. 
Then you must also believe that the probability that you will receive an A in both is 
(.5) x (.6) = .3. Notice that we can now complete the calculation we started at the 
end of Example 6. The probability that you will receive at least one A is found by 
taking .5 + .6 — .3 = .8, or 80%. Note that by Rule 1 you must also believe that 
the probability of not receiving an A in either class is 20%. 7 


Rule 3 is sometimes difficult for people to understand, but if you think of it in 
terms of the relative-frequency interpretation of probability, it’s really quite simple. 
Consider women who have had two children. If about half of the women had a boy 
for their first child, and only about half of those women had a girl the second time 
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EXAMPLE 9 


EXAMPLE 10 


around, it makes sense that we are left with only about 25%, or one-fourth, of the 
women. 


Let's try one more example of Rule 3, using the logic just outlined. Suppose you en- 
counter a red light on 30% of your commutes and get behind a bus on half of your 
commutes. The two are unrelated because whether you have the bad luck to get be- 
hind a bus presumably has nothing to do with the red light. The probability of having 
a really bad day and having both happen is 15%. This is logical because you get be- 
hind a bus half of the time. Therefore, you get behind a bus half of the 30% of the 
time you encounter the red light, resulting in total misery only 15% of the time. Using 
Rule 3 directly, this is equivalent to (.30) x (.50) = .15, or 15% of the time both events 
happen. E 


One more rule is such common sense that it almost doesn’t warrant writing 
down. However, as we will see in Chapter 17, in certain situations this rule will ac- 
tually seem counterintuitive. Here is the rule: 


Rule 4: If the ways in which one event can occur are a subset of those in which 
another event can occur, then the probability of the subset event cannot be 
higher than the probability of the one for which it is a subset. 


Suppose you are 18 years old and speculating about your future. You decide that the 
probability that you will eventually get married and have children is 75%. By Rule 4, you 
must then assume that the probability that you will eventually get married is at least 
75%. The possible futures in which you get married and have children are a subset of 
the possible futures in which you get married. 7 


16.5 When Will It Happen? 


Often, we would like an event to occur and will keep trying to make it happen until 
it does, such as when a couple keeps having children until they have one of the de- 
sired sex. Also, we often gamble that something won’t go wrong, even though we 
know it could, such as when people have unprotected sex and hope they won’t get 
infected with HIV, the virus that is suspected to cause AIDS. 

A simple application of our probability rules allows us to determine the chances 
of waiting one, two, three, or any given number of repetitions for such events to oc- 
cur. Suppose (1) we know the probability of each possible outcome on any given oc- 
casion, (2) those probabilities remain the same for each occasion, and (3) the 
outcome each time is independent of the outcome all of the other times. 

Let’s use some shorthand. Define the probability that the outcome of interest will 
occur on any given occasion to be p so that the probability that it will not occur is 


EXAMPLE 11 
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Table 16.2 Calculating Probabilities 
Try on Which the 


Outcome First Happens Probability 
1 p 
2 (1 — p)p 
3 (1 — p)(1 — p)p = (1 — pp 
4 (1 = p — p) — pp = (1 — pp 
5 (1 — p)(1 — p)(1 — p)(1 — pp = (1 — pp 


(1 — p) by Rule 1. For instance, if we are interested in giving birth to a girl, p is .49 
and (1 — p) is .51. 

We already know the probability that the outcome occurs on the first try is p. By 
Rule 3, the probability that it doesn’t occur on the first try but does occur on the sec- 
ond try is found by multiplying two probabilities. Namely, it doesn’t happen at first 
(1 — p) and then it does happen ( p). Thus, the probability that it happens for the first 
time on the second try is (1 — p)p. We can continue this logic. We multiply (1 — p) 
for each time it doesn’t happen, followed by p for when it finally does happen. We 
can represent these probabilities as shown in Table 16.2, and you can see the emerg- 
ing pattern. 


Number of Births to First Girl 


The probability of a birth resulting in a boy is about .51, and the probability of a birth 
resulting in a girl is about .49. Suppose a couple would like to continue having children 
until they have a girl. Assuming the outcomes of births are independent of each other, 
the probabilities of having the first girl on the first, second, third, fifth, and seventh tries 
are shown in Table 16.3. E 


Accumulated Probability 


We are often more interested in the cumulative probability of something happening 
by a certain time than just the specific occasion on which it will occur. For example, 


Table 16.3 Probability of a Birth Resulting in a First Girl 


Number of Births 
to First Girl Probability 
1 .49 
2 (.51)(.49) = .2499 
3 (.51)(.51)(.49) = .1274 
5 (.51)(.51)(.51)(.51)(.49) = .0331 
7 (.51)(.51)(.51)(.51)(.51)(.51)(.49) = .0086 
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EXAMPLE 12 


we would probably be more interested in knowing the probability that we would 
have had the first girl by the time of the fifth child, rather than the probability that it 
would happen at that specific birth. It is easy to use the probability rules to find this 
accumulated probability. Notice that the probability of the first occurrence not hap- 
pening by occasion n is (1 — p)”. Therefore, the probability that the first occurrence 
has happened by occasion n is [1 — (1 — p)"] from Rule 1. For instance, the proba- 
bility that a girl will not have been born by the third birth is (1 AP = (Sip = 
.1327. Thus, the probability that a girl will have been born by the third birth is 1 — 
.1327 = .8673. This is equivalent to adding the probabilities that the first girl occurs 
on the first, second, or third tries: .49 + .2499 + .1274 = .8673. 


Getting Infected with HIV 


According to Krantz (1992, p. 13), the probability of getting infected with HIV from a 
single heterosexual encounter without a condom, with a partner whose risk status you 
don't know, is between 1/500 and 1/500,000. For the sake of this example, let's as- 
sume it is the higher figure of 1/500 = .002. Therefore, on such a single encounter, the 
probability of not getting infected is 499/500 = .998. However, the risk of getting in- 
fected goes up with multiple encounters, and, using the strategy we have just outlined, 
we can calculate the probabilities associated with the number of encounters it would 
take to become infected. Of course, the real interest is in whether infection will have oc- 
curred after a certain number of encounters, and not just on the exact encounter dur- 
ing which it occurs. In Table 16.4, we show this accumulated probability as well. It is 
found by adding the probabilities to that point, using Rule 2. Equivalently, we could use 
the general form we found just prior to this example. For instance, the probability of HIV 
infection by the second encounter is [1 — (1 — .002)] or [1 — .9987] or .003996. 

Table 16.4 tells us that although the risk after a single encounter is only 1 in 500, af- 
ter ten encounters the accumulated risk has risen to almost .02, or almost 1 in 50. This 
means that out of all those people who have 10 encounters, about 1 in 50 of them is 
likely to get infected with HIV. Also, according to Krantz (1992, p. 13), the probability of 
infection with a partner whose status is unknown if a condom is used is between 
1/5000 and 1 in 5 million. Assuming the higher figure of 1/5000, the probability of in- 
fection after 10 encounters is only 1/5009, or about .0002. 

The base rate, or probability of infection, may have changed by the time you read 
this. But the method for calculating the risk remains the same, and you can reevaluate 
all of the numbers if you know the current risk for a single encounter. a 


Table 16.4 The Probability of Getting Infected with HIV 
from Unprotected Sex 


Number of Probability of Accumulated 
Encounters First Infection Probability of HIV 
1 .002 .002 
2 (.998)(.002) = .001996 .003996 
4 (.998)3(.002) = .001988 007976 
10 (.998)°(.002) = .001964 019821 


EXAMPLE 13 
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Table 16.5 Probabilities of Winning Pick Six 


Number Probability Accumulated 


of Plays of First Win Probability of Win 
1 1/54 = .0185 0185 
2 (53/54)(1/54) = .0182 .0367 
5 (53/54)*(1/54) = .0172 .0892 
10 (53/54)°(1/54) = .0157 1705 
20 (53/54)'9(1/54) = .0130 3119 


Winning the Lottery 


To play the New Jersey Pick Six game, a player picks six numbers from the choices 1 to 
49. Six winning numbers are selected. If the player has matched at least three of the 
winning numbers, the ticket is a winner. Matching three numbers results in a prize of 
$3.00; matching four or more results in a prize determined by the number of other suc- 
cessful entries. The probability of winning anything at all is 1/54. How many times 
would you have to play before winning anything? See Table 16.5. 

If you do win, your most likely prize is only $3.00. Notice that even after purchasing 
five tickets, which cost one dollar each, your probability of having won anything is still 
under 10%; in fact, it is about 9%. After 20 tries, your probability of having won any- 
thing is just over 31%. In the next section, we learn how to determine the average ex- 
pected payoff from playing games like this. a 


16.6 Long-Term Gains, Losses, and Expectations 


EXAMPLE 14 


The concept of the long-run relative frequency of various outcomes can be used to 
predict long-term gains and losses. Although it is impossible to predict the result of 
one random happening, we can be remarkably successful in predicting aggregate or 
long-term results. For example, we noted that the probability of winning anything at 
all in the New Jersey Pick Six game with a single ticket is | in 54. Among the mil- 
lions of people who play that game regularly, some will be winners and some will 
be losers. We cannot predict who will win and who will lose. However, we can pre- 
dict that in the long run, about | in every 54 tickets sold will be a winner. 


Long-Term Outcomes Can Be Predicted 


It is because aggregate or long-term outcomes can be accurately predicted that lot- 
tery agencies, casinos, and insurance companies are able to stay in business. Because 
they can closely predict the amount they will have to pay out over the long run, they 
can determine how much to charge and still make a profit. 


Insurance Policies 


Suppose an insurance company has thousands of customers, and each customer is 
charged $500 a year. The company knows that about 10% of them will submit a claim 
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in any given year and that claims will always be for $1500. How much can the company 
expect to make per customer? 

Notice that there are two possibilities. With probability .90 (or for about 90% of the 
customers), the amount gained by the company is $500, the cost of the policy. With 
probability .10 (or for about 10% of the customers), the “amount gained” by the com- 
pany is the $500 cost of the policy minus the $1500 payoff, for a loss of $1000. We rep- 
resent the loss by saying that the “amount gained” is —$1000, or negative one 
thousand dollars. Here are the possible amounts “gained” and their probabilities: 


Claim Paid? Probability Amount Gained 


Yes .10 —$1000 
No .90 +$ 500 


What is the average amount gained, per customer, by the company? Because the 
company gains $500 from 90% of its customers and loses $1000 from the remaining 
10%, its average “gain” per customer is: 


average gain = .90 ($500) — .10 ($1000) = $350 


In other words, the company makes an average of $350 per customer. Of course, to 
succeed this way, it must have a large volume of business. If it had only a few customers, 
the company could easily lose money in any given year. As we have seen, long-run fre- 
quencies apply only to a large aggregate. For example, if the company had only two cus- 
tomers, we could use Rule 3 to find that the probability of the company’s having to pay 
both of them during a given year is .1 X .1 = .01 = 1/100. This calculation assumes 
that the probability of paying for one individual is independent of that for the other in- 
dividual, which is a reasonable assumption unless the customers are somehow related 
to each other. a 


Expected Value 


Statisticians use the phrase expected value (EV) to represent the average value of 
any measurement over the long run. The average gain of $350 per customer for our 
hypothetical insurance company is called the expected value for the amount the com- 
pany earns per customer. Notice that the expected value does not have to be one of 
the possible values. For our insurance company, the two possible values were $500 
and — $1000. Thus, the expected value of $350 was not even a possible value for any 
one customer. In that sense, “expected value” is a real misnomer. It doesn’t have to 
be a value that’s ever expected in a single outcome. 

To compute the expected value for any situation, we need only be able to spec- 
ify the possible amounts—call them A, A>, A3, . . . , A;—and the associated proba- 
bilities, which can be denoted by pj, P2, P3, . . . , Px. Then the expected value can be 
found by multiplying each possible amount by its probability and adding them up. 
Remember, the expected value is the average value per measurement over the long 
run and not necessarily a typical value for any one occasion or person. 


EXAMPLE 15 


CHAPTER 16 Understanding Probability and Long-Term Expectations 309 


Computing the Expected Value 
EV = expected value = A,p,; + A2p2 + A3p3 + ; <- + Ape 


California Decco Lottery Game 


The California lottery has offered a number of games over the years. One such game is 
Decco, in which players choose one card from each of the four suits in a regular deck of 
playing cards. For example, the player might choose the 4 of hearts, 3 of clubs, 10 of di- 
amonds, and jack of spades. A winning card is then drawn from each suit. If even one 
of the choices matches the winning cards drawn, a prize is awarded. It costs one dollar 
for each play, so the net gain for any prize is one dollar less than the prize. Table 16.6 
shows the prizes and the probability of winning each prize, taken from the back of a 
game card. 
We can thus compute the expected value for this Decco game: 


EV = ($4999 x 1/28,561) + ($49 x 1/595) + ($4 x .0303) + (—$1 x .726) 
= —$0.35 


Notice that we count the free ticket as an even trade because it is worth $1, the 
same amount it cost to play the game. This result tells us that over many repetitions of 
the game, you will lose an average of 35 cents each time you play. From the perspective 
of the Lottery Commission, about 65 cents is paid out for each one dollar ticket sold for 
this game. (Astute readers will realize that this is an underestimate for the Lottery Com- 
mission and an overestimate for the player. The true cost of giving the free ticket as a 
prize is the expected payout per ticket, not the $1.00 purchase price of the ticket.) m 


Expected Value as Mean Number 


If the measurement in question is one taken over a large group of individuals, rather 
than across time, the expected value can be interpreted as the mean value per indi- 
vidual. As a simple example, suppose we had a population in which 40% of the peo- 
ple smoked a pack of cigarettes a day (20 cigarettes) and the remaining 60% smoked 


Table 16.6 Probability of Winning the Decco Game 


Number 
of Matches Prize Net Gain Probability 
4 $5000 $4999 1/28,561 = .000035 
3 $50 $49 1/595 = .00168 
2 $5 $4 1/33 = .0303 
1 Free ticket 0 2420 
0 None -$1 .7260 
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none. Then the expected value for the number of cigarettes smoked per day by one 
person would be 


EV = (.40 X 20 cigarettes) + (.60 X 0 cigarettes) = 8 cigarettes 


In other words, on average, eight cigarettes are smoked per person per day. If we 
were to measure each person in the population by asking them how many cigarettes 
they smoked per day (and they answered truthfully), then the arithmetic average 
would be 8. This example further illustrates the fact that the expected value is not a 
value we actually expect to measure on any one individual. 


Birthdays and Death Days—Is There a Connection? 
Source: Phillips, Van Voorhies, and Ruth (1992). 


Is the timing of death random or does it depend on significant events in one’s life? 
That’s the question University of California at San Diego sociologist David Phillips 
and his colleagues attempted to answer. Previous research had shown a possible con- 
nection between the timing of death and holidays and other special occasions. This 
study focused on the connection between birthday and death day. 

The researchers studied death certificates of all Californians who had died be- 
tween 1969 and 1990. Because of incomplete information before 1978, we report 
only on the part of their study that included the years 1979 to 1990. They limited 
their study to adults (over 18) who had died of natural causes. They eliminated any- 
one for whom surgery had been a contributing factor to death because there is some 
choice as to when to schedule surgery. They also omitted those born on February 29 
because there was no way to know on which date these people celebrated their birth- 
day in non-leap years. 

Because there is a seasonal component to birthdays and death days, the re- 
searchers adjusted the numbers to account for those as well. They determined the 
number of deaths that would be expected on each day of the year if date of birth and 
date of death were independent of each other. Each death was then classified as to 
how many weeks after the birthday it occurred. For example, someone who died 
from O to 6 days after his or her birthday was classified as dying in “Week 0,” 
whereas someone who died from 7 to 13 days after the birthday was classified in 
“Week 1,” and so on. Thus, people who died in Week 51 died within a few days be- 
fore their birthdays. 

Finally, the researchers compared the actual numbers of deaths during each week 
with what would be expected based on the seasonally adjusted data. Here is what 
they found. For women, the biggest peak was in Week 0. For men, the biggest peak 
was in Week 51. In other words, the week during which the highest number of 
women died was the week after their birthdays. The week during which the highest 
number of men died was the week before their birthdays. 

Perhaps this observation is due only to chance. Each of the 52 weeks is equally 
likely to show the biggest peak. What is the probability that the biggest peak for the 
women would be Week 0 and the biggest peak for the men would be Week 51? Us- 
ing Rule 3, the probability of both events occurring is (1/52) X (1/52) = 1/2704 = 
.0004. 
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As we will learn in Chapter 18, unusual events often do happen just by chance. 
Many facts given in the original report, however, add credence to the idea that this 
is not a chance result. For example, the peak for women in Week 0 remained even 
when the deaths were separated by age group, by race, and by cause of death. It was 
also present in the sample of deaths from 1969 to 1977. Further, earlier studies from 
various cultures have shown that people tend to die just after holidays important to 
that culture. 


For Those Who Like Formulas 


Notation 


Denote “events” or “outcomes” with capital letters A, B, C, and so on. 
If A is one outcome, all other possible outcomes are part of “A complement” = AS, 


P(A) is the probability that the event or outcome A occurs. For any event A, 
0 P(A) = 1. 
Rule 1 
P(A) + P(AS) = 1 
A useful formula that results from this is 


PAS) = 1 — P(A) 


Rule 2 
If events A and B are mutually exclusive, then 


P(A or B) = P(A) + P(B) 


Rule 3 
If events A and B are independent, then 

P(A and B) = P(A) P(B) 
Rule 4 


If the ways in which an event B can occur are a subset of those for event A, then 


P(B) = P(A) 


Exercises 


Asterisked (*) exercises are included in the Solutions at the back of the book. 


1. Recall that there are two interpretations of probability: relative frequency and 
personal probability. 
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a. 


b. 


Which interpretation applies to this statement: “The probability that I will 
get the flu this winter is 30%”? Explain. 


Which interpretation applies to this statement: “The probability that a ran- 
domly selected adult in America will get the flu this winter is 30%”? Ex- 
plain. (Assume it is known that the proportion of adults who get the flu each 
winter remains at about 30%.) 


*2. Use the probability rules in this chapter to solve each of the following: 


ta. 


The probability that a randomly selected Caucasian American child will have 
blonde or red hair is 23%. The probability of having blonde hair is 14%. 
What is the probability of having red hair? 


. According to Blackenhorn (24—26 February 1995), in 1990 the probability 


that a randomly selected child was living with his or her mother as the sole 
parent was .216 and with his or her father as the sole parent was .031. What 
was the probability that a child was living with just one parent? 


. In 2001, the probability that a birth would result in twins was .0301, and the 


probability that a birth would result in triplets or more was .0019 (Source: 
National Center for Health Statistics.) What was the probability that a birth 
in 2001 resulted in a single child only? 


3. There is something wrong in each of the following statements. Explain what is 
wrong. 


*6, 


a. 


b. 
c. 


The probability a randomly selected driver will be wearing a seat belt is .75, 
whereas the probability that he or she will not be wearing one is .30. 


The probability that a randomly selected car is red is 1.20. 


The probability that a randomly selected car is red is .20, whereas the prob- 
ability that a randomly selected car is a red sports car is .25. 


. According to Krantz (1992, p. 111), the probability of being born on a Friday 


the 13th is about 1/214. 


a. 
b. 


What is the probability of not being born on a Friday the 13th? 


In any particular year, Friday the 13th can occur once, twice, or three times. 
Is the probability of being born on Friday the 13th the same every year? 
Explain. 


. Explain what it means to say that the probability of being born on Friday the 


13th is 1/214. 


. Explain which of the following more closely describes what it means to say that 


the probability of a tossed coin landing with heads up is 1/2: 


Explanation 1: After more and more tosses, the fraction of heads will get 
closer and closer to 1/2. 

Explanation 2: The number of heads will always be about half the number 
of tosses. 


Explain why probabilities cannot always be interpreted using the relative- 
frequency interpretation. Give an example of where that interpretation would 
not apply. 


7. 


*Q, 


10. 


11. 


12. 
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Suppose you wanted to test your ESP using an ordinary deck of 52 cards, which 
has 26 red and 26 black cards. You have a friend shuffle the deck and draw cards 
at random, replacing the card and reshuffling after each guess. You attempt to 
guess the color of each card. 


a. What is the probability that you guess the color correctly by chance? 


b. Is the answer in part a based on the relative-frequency interpretation of prob- 
ability or is it a personal probability? 

c. Suppose another friend has never tried the experiment but believes he has 
ESP and can guess correctly with probability .60. Is the value of .60 a 
relative-frequency probability or a personal probability? Explain. 


d. Suppose another friend guessed the color of 1000 cards and got 600 correct. 
The friend claims she has ESP and has a .60 probability of guessing cor- 
rectly. Is the value of .60 a relative-frequency probability or a personal prob- 
ability? Explain. 


. Suppose you wanted to determine the probability that someone randomly se- 


lected from the phone book in your town or city has the same first name as you. 


a. Assuming you had the time and energy to do it, how would you go about de- 
termining that probability? (Assume all names listed are spelled out.) 


b. Using the method you described in part a, would your result be a relative- 
frequency probability or a personal probability? Explain. 


A small business performs a service and then bills its customers. From past ex- 
perience, 90% of the customers pay their bills within a week. 


*a, What is the probability that a randomly selected customer will not pay within 
a week? 


*b. The business has billed two customers this week. What is the probability that 
neither of them will pay within a week? What assumption did you make to 
compute that probability? Is it a reasonable assumption? 


Suppose the probability that you get an interesting piece of mail on any given 
weekday is 1/10. Is the probability that you get at least one interesting piece of 
mail during the week (Monday to Friday) equal to 5/10? Why or why not? 


The probability that a randomly selected American adult belongs to the Ameri- 
can Automobile Association (AAA) is .10 (10%), and the probability that that 
person belongs to the American Association of Retired Persons (AARP) is .11 
(11%) (Krantz, 1992, p. 175). What assumption would we have to make in or- 
der to use Rule 3 to conclude that the probability that a person belongs to both 
is (.10) X (.11) = .011? Do you think that assumption holds in this case? 
Explain. 

A study by Kahneman and Tversky (1982, p. 496) asked people the following 
question: “Linda is 31 years old, single, outspoken, and very bright. She ma- 
jored in philosophy. As a student, she was deeply concerned with issues of 
discrimination and social justice, and also participated in antinuclear demon- 
strations. Please check off the most likely alternative: 
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13. 


*14, 


15. 


16. 


*17. 


A. Linda is a bank teller. 
B. Linda is a bank teller and is active in the feminist movement.” 


Nearly 90% of the 86 respondents chose alternative B. Explain why alternative 
B cannot have a higher probability than alternative A. 


Example 3 in this chapter states that “the probability that a piece of checked lug- 
gage will be temporarily lost on a flight with a U.S. airline is 1/176.” Interpret 
that statement, using the appropriate interpretation of probability. 


In Section 16.2, you learned two ways in which relative-frequency probabilities 
can be determined. Explain which method you think was used to determine each 
of the following probabilities: 


*a. The probability that a particular flight from New York to San Francisco will 
be on time is .78. 


b. On any given day, the probability that a randomly selected American adult 
will read a book for pleasure is .33. 


c. The probability that a five-card poker hand contains “four of a kind” is 
.00024. 


People are surprised to find that it is not all that uncommon for two people in a 
group of 20 to 30 people to have the same birthday. We will learn how to find 
that probability in a later chapter. For now, consider the probability of finding 
two people who have birthdays in the same month. Make the simplifying as- 
sumption that the probability that a randomly selected person will have a birth- 
day in any given month is 1/12. Suppose there are three people in a room and 
you consecutively ask them their birthdays. Your goal, following parts a-d (be- 
low), is to determine the probability that at least two of them were born in the 
same calendar month. 


a. What is the probability that the second person you ask will not have the same 
birth month as the first person? (Hint: Use Rule 1.) 


b. Assuming the first and second persons have different birth months, what is 
the probability that the third person will have yet a different birth month? 
(Hint: Suppose January and February have been taken. What proportion of 
all people will have birth months from March to December?) 


c. Explain what it would mean about overlap among the three birth months if 
the outcomes in part a and part b both happened. What is the probability that 
the outcomes in part a and part b will both happen? 


d. Explain what it would mean about overlap among the three birth months if 
the outcomes in part a and part b did not both happen. What is the probabil- 
ity of that occurring? 


Use your own particular expertise to assign a personal probability to something, 
such as the probability that a certain sports team will win next week. Now assign 
a personal probability to another related event. Explain how you determined 
each probability, and explain how your assignments are coherent. 


Read the definition of “independent events” given in Rule 3. Explain whether 
each of the following pairs of events is likely to be independent: 


18. 


19. 


20. 


21. 
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*a, A married couple goes to the voting booth. Event A is that the husband votes 
for the Republican candidate; event B is that the wife votes for the Republi- 
can candidate. 


b. Event A is that it snows tomorrow; event B is that the high temperature to- 
morrow is at least 60 degrees Fahrenheit. 


c. You buy a lottery ticket, betting the same numbers two weeks in a row. Event 
A is that you win in the first week; event B is that you win in the second 
week. 


d. Event A is that a major earthquake will occur somewhere in the world in the 
next month; event B is that the Dow Jones Industrial Average will be higher 
in one month than it is now. 


Suppose you routinely check coin-return slots in vending machines to see if they 
have any money in them. You have found that about 10% of the time you find 
money. 


a. What is the probability that you do not find money the next time you check? 

b. What is the probability that the next time you will find money is on the third 
try? 

c. What is the probability that you will have found money by the third try? 


Lyme disease is a disease carried by ticks, which can be transmitted to humans 
by tick bites. Suppose the probability of contracting the disease is 1/100 for 
each tick bite. 


a. What is the probability that you will not get the disease when bitten once? 


b. What is the probability that you will not get the disease from your first tick 
bite and will get it from your second tick bite? 


According to Krantz (1992, p. 161), the probability of being injured by lightning 
in any given year is 1/685,000. Assume that the probability remains the same 
from year to year and that avoiding a strike in one year doesn’t change your 
probability in the next year. 


a. What is the probability that someone who lives 70 years will never be struck 
by lightning? You do not need to compute the answer, but write down how it 
would be computed 

b. According to Krantz, the probability of being injured by lightning over the 
average lifetime is 1/9100. Show how that probability should relate to your 
answer in part a, assuming that average lifetime is about 70 years. 

c. Do the probabilities given in this exercise apply specifically to you? Explain. 

d. About 290 million people live in the United States. In a typical year, assum- 
ing Krantz’s figure is accurate, about how many would be expected to be 
struck by lightning? 

Suppose you have to cross a train track on your commute. The probability that 


you will have to wait for a train is 1/5, or .20. If you don’t have to wait, the com- 
mute takes 15 minutes, but if you have to wait, it takes 20 minutes. 
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22. 


23. 


*24. 


25. 


*26. 


27. 


a. What is the expected value of the time it takes you to commute? 
b. Is the expected value ever the actual commute time? Explain. 


Remember that the probability that a birth results in a boy is about .51. You of- 

fer a bet to an unsuspecting friend. Each day you will call the local hospital and 

find out how many boys and how many girls were born the previous day. For 

each girl, you will give your friend $1 and for each boy your friend will give 

you $1. 

a. Suppose that on a given day there are 3 births. What is the probability that 
you lose $3 on that day? What is the probability that your friend loses $3? 

b. Notice that your net profit is $1 if a boy is born and -$1 if a girl is born. What 
is the expected value of your profit for each birth? 

c. Using your answer in part b, how much can you expect to make after 1000 
births? 

In the “3 Spot” version of the former California Keno lottery game, the player 

picked three numbers from 1 to 40. Ten possible winning numbers were then 

randomly selected. It cost $1 to play. The table shows the possible outcomes. 

Compute the expected value for this game. Interpret what it means. 


Number of Matches Amount Won Probability 


3 $20 .012 
2 $2 .137 
Oor1 $0 851 


Suppose the probability that you get an A in any class you take is .3 and the 
probability that you get a B is .7. To construct a grade-point average, an A is 
worth 4.0 and a B is worth 3.0. What is the expected value for your grade-point 
average? Would you expect to have this grade-point average separately for each 
quarter or semester? Explain. 


In 1991, 72% of children in the United States were living with both parents, 
22% were living with mother only, 3% were living with father only, and 3% 
were not living with either parent (World Almanac and Book of Facts, 1993, 
p. 945). What is the expected value for the number of parents a randomly se- 
lected child was living with? Does the concept of expected value have a mean- 
ingful interpretation for this example? Explain. 


We have seen many examples for which the term expected value seems to be a 
misnomer. Construct an example of a situation where the term expected value 
would not seem to be a misnomer for what it represents. 


Find out your yearly car insurance cost. If you don’t have a car, find out the 
yearly cost for a friend or relative. Now assume you will either have an accident 
or not, and if you do, it will cost the insurance company $5000 more than the 
premium you pay. Calculate what yearly accident probability would result in a 
“break-even” expected value for you and the insurance company. Comment on 


Mini-Projects 


28. 
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whether you think your answer is an accurate representation of your yearly 
probability of having an accident. 


On November 9, 2001 the Sacramento Bee reported, “Using new data, scientists 
have dramatically lowered the odds [that an asteroid will wipe out the Earth]. 
They now say there’s just a 1-in-5,000 chance that an asteroid bigger than half- 
a-mile wide will hit the Earth in the next 100 years” (p. A31). Is this probabil- 
ity based on relative frequency or is it a personal probability? Explain. 


. Refer to Exercise 12. Present the question to 10 people, and note the proportion 


who answer with alternative B. Explain to the participants why it cannot be the 
right answer, and report on their reactions. 


. Flip a coin 100 times. Stop each time you have done 10 flips (that is, stop after 


10 flips, 20 flips, 30 flips, and so on), and compute the proportion of heads us- 
ing all of the flips up to that point. Plot that proportion versus the number of 
flips. Comment on how the plot relates to the relative-frequency interpretation 
of probability. 


. Pick an event that will result in the same outcome for everyone, such as whether 


it will rain next Saturday. Ask 10 people to assess the probability of that event, 
and note the variability in their responses. (Don’t let them hear each other’s an- 
swers, and make sure you don’t pick something that would have 0 or 1 as a com- 
mon response.) At the same time, ask them the probability of getting a heart 
when a card is randomly chosen from a fair deck of cards. Compare the vari- 
ability in responses for the two questions, and explain why one is more variable 
than the other. 


. Find two lottery or casino games that have fixed payoffs and for which the prob- 


abilities of each payoff are available. (Some lottery tickets list them on the back 

of the ticket or on the lottery’s Web site. Some books about gambling give the 

payoffs and probabilities for various casino games.) 

a. Compute the expected value for each game. Discuss what they mean. 

b. Using both the expected values and the list of payoffs and probabilities, ex- 
plain which game you would rather play and why. 
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Psychological Influences 
on Personal Probability 


Thought Questions 


1. During the Cold War, Plous (1993) presented readers with the following test. 


Place a check mark beside the alternative that seems most likely to occur within the 
next 10 years: 


m An all-out nuclear war between the United States and Russia 


m An all-out nuclear war between the United States and Russia in which neither 
country intends to use nuclear weapons, but both sides are drawn into the conflict 
by the actions of a country such as Iraq, Libya, Israel, or Pakistan. 


Using your intuition, pick the more likely event at that time. Now consider the prob- 
ability rules discussed in Chapter 16 to try to determine which statement was more 
likely. 

2. Which is a more likely cause of death in the United States, homicide or diabetes? 
How did you arrive at your answer? 

3. Do you think people are more likely to pay to reduce their risk of an undesirable event 
from 95% to 90% or to reduce it from 5% to zero? Explain whether there should be 
a preferred choice, based on the material from Chapter 16. 


4. A fraternity consists of 30% freshmen and sophomores and 70% juniors and seniors. 
Bill is a member of the fraternity, he studies hard, he is well liked by his fellow fra- 
ternity members, and he will probably be quite successful when he graduates. Is 
there any way to tell if Bill is more likely to be a lower classman (freshman or sopho- 
more) or an upper classman (junior or senior)? 
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17.1 Revisiting Personal Probability 


In Chapter 16, we assumed that the probabilities of various outcomes were known 
or could be calculated using the relative-frequency interpretation of probability. But 
most decisions people make are in situations that require the use of personal proba- 
bilities. The situations are not repeatable, nor are there physical assumptions that can 
be used to calculate potential relative frequencies. 

Personal probabilities, remember, are values assigned by individuals based on 
how likely they think events are to occur. By their very definition, personal proba- 
bilities do not have a single correct value. However, they should still follow the rules 
of probability, which we outlined in Chapter 16; otherwise, decisions based on them 
can be contradictory. For example, if you believe there is a very high probability that 
you will be killed in an automobile accident before you reach a certain age, but also 
believe there is a very high probability that you will live to be 100, then you will be 
conflicted over how well to protect your health. Your two personal probabilities are 
not consistent with each other and will lead to contradictory decisions. 

In this chapter, we explore research that has shown how personal probabilities 
can be influenced by psychological factors in ways that lead to incoherent or incon- 
sistent probability assignments. We also examine circumstances in which many peo- 
ple assign personal probabilities that can be shown to be incorrect based on the 
relative-frequency interpretation. Every day, you are required to make decisions that 
involve risks and rewards. Understanding the kinds of influences that can affect your 
decisions adversely should help you make more realistic judgments. 


17.2 Equivalent Probabilities; Different Decisions 


People like a sure thing. It would be wonderful if we could be guaranteed that can- 
cer would never strike us, for instance, or that we would never be in an automobile 
accident. For this reason, people are willing to pay a premium to reduce their risk of 
something to zero, but are not as willing to pay to reduce their risk by the same 
amount to a nonzero value. We consider two versions of this psychological reality. 


The Certainty Effect 


Suppose you are buying a new car. The salesperson explains that you can purchase 
an optional safety feature for $200 that will reduce your chances of death in a high- 
speed accident from 50% to 45%. Would you be willing to purchase the device? 

Now suppose instead that the salesperson explains that you can purchase an op- 
tional safety feature for $200 that will reduce your chances of death in a high-speed 
accident from 5% to zero. Would you be willing to purchase the device? 

In both cases, your chances of death are reduced by 5%, or 1/20. But research 
has shown that people are more willing to pay to reduce their risk from a fixed 


EXAMPLE 1 


EXAMPLE 2 
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amount down to zero than they are to reduce their risk by the same amount when it 
is not reduced to zero. This is called the certainty effect (Plous, 1993, p. 99). 


Probabilistic Insurance 


To test whether the certainty effect influences decisions, Kahneman and Tversky (1979) 
asked students if they would be likely to buy “probabilistic insurance.” This insurance 
would cost half as much as regular insurance but would only cover losses with 50% 
probability. The majority of respondents (80%) indicated that they would not be inter- 
ested in such insurance. Notice that the expected value for the return on this insurance 
is the same as on the regular policy. It is the lack of assurance of a payoff that makes it 
unattractive. a 


The Pseudocertainty Effect 


A related idea, used in marketing, is that of the pseudocertainty effect (Slovic, 
Fischhoff, and Lichtenstein, 1982, p. 480). Rather than being offered a reduced risk 
on a variety of problems, you are offered a complete reduction of risk on certain 
problems and no reduction on others. 

As an example, consider the extended warranty plans offered on automobiles and 
appliances. You pay a price when you purchase the item and certain problems are 
covered completely for a number of years. Other problems are not covered. If you 
were offered a plan that covered all problems with 30% probability, you would prob- 
ably not purchase it. But if you were offered a plan that completely covered 30% of 
the possible problems, you might consider it. Both plans have the same expected 
value over the long run, but most people prefer the plan that covers some problems 
with certainty. 


Vaccination Questionnaires 


To test the idea that the pseudocertainty effect influences decision making, Slovic and 
colleagues (1982) administered two different forms of a “vaccination questionnaire.” 
The first form described what the authors called “probabilistic protection,” in which a 
vaccine was available for a disease anticipated to afflict 20% of the population. How- 
ever, the vaccine would protect people with only 50% probability. Respondents were 
asked if they would volunteer to receive the vaccine, and 40% indicated that they 
would. 

The second form described a situation of “pseudocertainty,” in which there were 
two equally likely strains of the disease, each anticipated to afflict 10% of the popula- 
tion. The available vaccine was completely effective against one strain but provided no 
protection at all against the other strain. This time, 57% of respondents indicated they 
would volunteer to receive the vaccine. 

In both cases, receiving the vaccine would reduce the risk of disease from 20% to 
10%. However, the scenario in which there was complete elimination of risk for a sub- 
set of problems was perceived much more favorably than the one for which there was 
the same reduction of risk overall. This is what the pseudocertainty effect predicts. 

Plous (1993, p. 101) notes that a similar effect is found in marketing when items are 
given away free rather than having their price reduced. For example, rather than reduce 
all items by 50%, a merchandiser may instead advertise that you can “buy one, get one 
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free.” The overall reduction is the same, but the offer of free merchandise may be per- 
ceived as more desirable than the discount. E 


17.3 How Personal Probabilities Can Be Distorted 


Which do you think caused more deaths in the United States in 2000, homicide or 
diabetes? If you are like respondents to studies reported by Slovic and colleagues 
(1982, p. 467), you answered that it was homicide. The actual death rates were 6.0 
per 100,000 for homicide compared with 24.6 per 100,000 for diabetes (National 
Center for Health Statistics). 

The distorted view that homicide is more probable results from the fact that 
homicide receives more attention in the media. Psychologists attribute this incorrect 
perception to the availability heuristic. It is just one example of how perceptions of 
risk can be influenced by reference points. 


The Availability Heuristic 


Tversky and Kahneman (1982a, p. 11) note that “there are situations in which peo- 
ple assess the . . . probability of an event by the ease with which instances or occur- 
rences can be brought to mind. . . . This judgmental heuristic is called availability.” 
In the study summarized by Slovic and colleagues (1982), media attention to homi- 
cides severely distorted judgments about their relative frequency. Slovic and col- 
leagues noted that 


Homicides were incorrectly judged more frequent than diabetes and stomach 
cancer deaths. Homicides were also judged to be about as frequent as death by 
stroke, although the latter actually claims about 11 times as many deaths. 
(1982, p. 467) 


Availability can cloud your judgment in numerous instances. For example, if you 
are buying a used car, you may be influenced more by the bad luck of a friend or rel- 
ative who owned a particular model than by statistics provided by consumer groups 
based on the experience of thousands of owners. The memory of the one bad car in 
that class is readily available to you. 

Similarly, most people know many smokers who don’t have lung cancer. Fewer 
people know someone who has actually contracted lung cancer as a result of smok- 
ing. Therefore, it is easier to bring to mind the healthy smokers and, if you smoke, 
to believe that you too will continue to have good health. 


Detailed Imagination 


One way to encourage availability, in which the probabilities of risk can be distorted, 
is by having people vividly imagine an event. Salespeople use this trick when they 
try to sell you extended warranties or insurance. For example, they may convince 
you that $500 is a reasonable price to pay for an extended warranty on your new car 


EXAMPLE 3 


EXAMPLE 4 


CHAPTER 17 Psychological Influences on Personal Probability 323 


by having you imagine that if your air conditioner fails it will cost you more than the 
price of the policy to get it fixed. They don’t mention that it is extremely unlikely 
that your air conditioner will fail during the period of the extended warranty. 


Anchoring 


Psychologists have shown that people’s risk perception can also be severely distorted 
when they are provided with a reference point, or an anchor, from which they then 
adjust up or down. Most people tend to stay relatively close to the anchor, or initial 
value, provided. 


Nuclear War 


Plous (1993, pp. 146-147) conducted a survey between January 1985 and May 1987 in 
which he asked respondents to assess the likelihood of a nuclear war between the 
United States and the Soviet Union. He gave three different versions of the survey, which 
he called the low-anchor, high-anchor, and no-anchor conditions. In the low-anchor 
case, he asked people if they thought the chances were higher or lower than 1% and 
then asked them to give their best estimate of the exact chances. In the high-anchor 
case, the 1% figure was replaced by 90% or 99%. In the no-anchor case, they were 
simply asked to give their own assessment. 
According to Plous, 


In all variations of the survey, anchoring exerted a strong influence on likelihood 
estimates of a nuclear war. Respondents who were initially asked whether the 
probability of nuclear war was greater or less than 1 percent subsequently gave 
lower estimates than people who were not provided with an explicit anchor, 
whereas respondents who were first asked whether the probability of war was 
greater or less than 90 (or 99) percent later gave estimates that were higher than 
those given by respondents who were not given an anchor. (1993, p. 147) a 


Research has shown that anchoring influences real-world decisions as well. For 
example, jurors who are first told about possible harsh verdicts and then about more 
lenient ones are more likely to give a harsh verdict than jurors given the choices in 
reverse order. 


Sales Price of a House 


Plous (1993) describes a study conducted by Northcraft and Neale (1987), in which real 
estate agents were asked to give a recommended selling price for a home. They were 
given a 10-page packet of information about the property and spent 20 minutes walk- 
ing through it. Contained in the 10-page packet of information was a listing price. To 
test the effect of anchoring, four different listing prices, ranging from $119,900 to 
$149,900, were given to different groups of agents. The house had actually been ap- 
praised at $135,000. As the anchoring theory predicts, the agents were heavily influ- 
enced by the particular listing price they were given. The four listing prices and the 
corresponding mean recommended selling prices were 


Apparent listed price $119,900 $129,900 $139,900 $149,900 


Mean recommended 
sales price $117,745 $127,836 $128,530 $130,981 
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EXAMPLE 5 


As you can see, the recommended sales price differed by more than $10,000 just be- 
cause the agents were anchored at different listing prices. Yet, when asked how they 
made their judgments, very few of the agents mentioned the listing price as one of their 
top factors. 

Anchoring is most effective when the anchor is extreme in one direction or the 
other. It does not have to take the form of a numerical assessment either. Be wary 
when someone describes a worst- or best-case scenario and then asks you to make a 
decision. For example, an investment counselor may encourage you to invest in a cer- 
tain commodity by describing how one year it had such incredible growth that you 
would now be rich if you had only been smart enough to invest in the commodity dur- 
ing that year. If you use that year as your anchor, you'll fail to see that, on average, 
the price of this commodity has risen no faster than inflation. a 


The Representativeness Heuristic 
and the Conjunction Fallacy 


In some cases, the representativeness heuristic leads people to assign higher prob- 
abilities than are warranted to scenarios that are representative of how we imagine 
things would happen. For example, Tversky and Kahneman (1982a, p. 98) note that 
“the hypothesis ‘the defendant left the scene of the crime’ may appear less plausible 
than the hypothesis ‘the defendant left the scene of the crime for fear of being ac- 
cused of murder, although the latter account is less probable than the former.” 

It is the representativeness heuristic that sometimes leads people to fall into the 
judgment trap called the conjunction fallacy. We learned in Chapter 16 (Rule 4) that 
the probability of two events occurring together, in conjunction, cannot be higher 
than the probability of either event occurring alone. The conjunction fallacy occurs 
when detailed scenarios involving the conjunction of events are given higher proba- 
bility assessments than statements of one of the simple events alone. 


An Active Bank Teller 


A classic example, provided by Kahneman and Tversky (1982, p. 496), was a study in 
which they presented subjects with the following statement: 


Linda is 31 years old, single, outspoken, and very bright. She majored in philoso- 
phy. As a student, she was deeply concerned with issues of discrimination and so- 
cial justice, and also participated in antinuclear demonstrations. 


Respondents were then asked which of two statements was more probable: 


1. Linda is a bank teller. 
2. Linda is a bank teller who is active in the feminist movement. 


Kahneman and Tversky report that “in a large sample of statistically naive undergradu- 
ates, 86% judged the second statement to be more probable” (1982, p. 496). 

The problem with that judgment is that the group of people in the world who fit the 
second statement is a subset of the group who fit the first statement. If Linda falls into 
the second group (bank tellers who are active in the feminist movement), she must also 
fall into the first group (bank tellers). Therefore, the first statement must have a higher 
probability of being true. 
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The misjudgment is based on the fact that the second statement is much more rep- 
resentative of how Linda was described. This example illustrates that intuitive judgments 
can directly contradict the known laws of probability. In this example, it was easy for re- 
spondents to fall into the trap of the conjunction fallacy. E 


The representativeness heuristic can be used to affect your judgment by giving 
you detailed scenarios about how an event is likely to happen. For example, Plous 
(1993, p. 4) asks readers of his book to: 


Place a check mark beside the alternative that seems most likely to occur within 
the next 10 years: 


a An all-out nuclear war between the United States and Russia 


m An all-out nuclear war between the United States and Russia in which neither 
country intends to use nuclear weapons, but both sides are drawn into the conflict 
by the actions of a country such as Iraq, Libya, Israel, or Pakistan 


Notice that the second alternative describes a subset of the first alternative, and thus 
the first one must be at least as likely as the second. Yet, according to the represen- 
tativeness heuristic, most people would see the second alternative as more likely. 

Be wary when someone describes a scenario to you in great detail in order to try 
to convince you of its likelihood. For example, lawyers know that jurors are much 
more likely to believe a person is guilty if they are provided with a detailed scenario 
of how the person’s guilt could have occurred. 


Forgotten Base Rates 


The representativeness heuristic can lead people to ignore information they may 
have about the likelihood of various outcomes. For example, Kahneman and Tver- 
sky (1973) conducted a study in which they told subjects that a population consisted 
of 30 engineers and 70 lawyers. The subjects were first asked to assess the likelihood 
that a randomly selected individual would be an engineer. The average response was 
indeed close to the correct 30%. 

Subjects were then given the following description, written to give no clues as to 
whether this individual was an engineer or a lawyer. 


Dick is a 30-year-old man. He is married with no children. A man of high abil- 
ity and high motivation, he promises to be quite successful in his field. He is 
well liked by his colleagues. (Kahneman and Tversky, 1973, p. 243) 


This time, the subjects ignored the base rate. When asked to assess the likelihood 
that Dick was an engineer, the median response was 50%. Because the individual in 
question did not appear to represent either group more heavily, the respondents con- 
cluded that there must be an equally likely chance that he was either. They ignored 
the information that only 30% of the population were engineers. 

Neglecting base rates can cloud the probability assessments of experts as well. 
For example, physicians who are confronted with a patient’s positive test results for 
a rare disease routinely overestimate the probability that the patient actually has the 
disease. They fail to take into account the extremely low base rate in the population. 
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17.4 Optimism, Reluctance to Change, 
and Overconfidence 


EXAMPLE 6 


Psychologists have also found that some people tend to have personal probabilities 
that are unrealistically optimistic. Further, people are often overconfident about how 
likely they are to be right and are reluctant to change their views even when pre- 
sented with contradictory evidence. 


Optimism 


Slovic and colleagues (1982) cite evidence showing that most people view them- 
selves as personally more immune to risks than other people. They note that “the 
great majority of individuals believe themselves to be better than average drivers, 
more likely to live past 80, less likely than average to be harmed by the products they 
use, and so on” (pp. 469-470). 


Optimistic College Students 

Research on college students confirms that they see themselves as more likely than av- 
erage to encounter positive life events and less likely to encounter negative ones. 
Weinstein (1980) asked students at Cook College (part of Rutgers, the State University 
in New Jersey) to rate how likely certain life events were to happen to them compared 
to other Cook students of the same sex. Plous (1993) summarizes Weinstein’s findings: 


On the average, students rated themselves as 15 percent more likely than others 
to experience positive events and 20 percent less likely to experience negative 
events. To take some extreme examples, they rated themselves as 42 percent more 
likely to receive a good starting salary after graduation, 44 percent more likely to 
own their own home, 58 percent less likely to develop a drinking problem, and 38 
percent less likely to have a heart attack before the age of 40. (p. 135) 


Notice that if all the respondents were accurate, the median response for each 
question should have been 0 percent more or less likely because approximately half of 
the students should be more likely and half less likely than average to experience any 
event. E 


The tendency to underestimate one’s probability of negative life events can lead 
to foolish risk taking. Examples are driving while intoxicated and having unpro- 
tected sex. Plous (1993, p. 134) calls this phenomenon, “It’ll never happen to me,” 
whereas Slovic and colleagues (1982, p. 468) title it, “It won’t happen to me.” The 
point is clear: If everyone underestimates his or her own personal risk of injury, 
someone has to be wrong . . . it will happen to someone. 


Reluctance to Change 


In addition to optimism, most people are also guilty of conservatism. As Plous 
(1993) notes, “Conservatism is the tendency to change previous probability estimates 


EXAMPLE 7 
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more slowly than warranted by new data” (p. 138). This explains the reluctance of the 
scientific community to accept new paradigms or to examine compelling evidence for 
phenomena such as extrasensory perception. As noted by Hayward (1984): 


There seems to be a strong need on the part of conventional science to exclude 
such phenomena from consideration as legitimate observation. Kuhn and Fey- 
erabend showed that it is always the case with “normal” or conventional sci- 
ence that observations not confirming the current belief system are ignored or 
dismissed. The colleagues of Galileo who refused to look through his telescope 
because they “knew” what the moon looked like are an example. (pp. 78-79) 


This reluctance to change one’s personal-probability assessment or belief based 
on new evidence is not restricted to scientists. It is notable there only because sci- 
ence is supposed to be “objective.” 


Overconfidence 


Consistent with the reluctance to change personal probabilities in the face of new 
data is the tendency for people to place too much confidence in their own assess- 
ments. In other words, when people venture a guess about something for which they 
are uncertain, they tend to overestimate the probability that they are correct. 


How Accurate Are You? 


Fischhoff, Slovic, and Lichtenstein (1977) conducted a study to see how accurate as- 
sessments were when people were sure they were correct. They asked people to answer 
hundreds of questions on general knowledge, such as whether Time or Playboy had a 
larger circulation or whether absinthe is a liqueur or a precious stone. They also asked 
people to rate the odds that they were correct, from 1:1 (50% probability) to 
1,000,000:1 (virtually certain). 

The researchers found that the more confident the respondents were, the more the 
true proportion of correct answers deviated from the odds given by the respondents. For 
example, of those questions for which the respondents gave even odds of being correct 
(50% probability), 53% of the answers were correct. However, of those questions for 
which they gave odds of 100:1 (99% probability) of being correct, only 80% of the re- 
sponses were actually correct. a 


Researchers have found a way to help eliminate overconfidence. As Plous (1993) 
notes, “The most effective way to improve calibration seems to be very simple: Stop 
to consider reasons why your judgment might be wrong” (p. 228). In a study by Ko- 
riat, Lichtenstein, and Fischhoff (1980), respondents were asked to list reasons to 
support and to oppose their initial judgments. The authors found that when subjects 
were asked to list reasons to oppose their initial judgment, their probabilities became 
extremely well calibrated. In other words, respondents were much better able to 
judge how much confidence they should have in their answers when they considered 
reasons why they might be wrong. 
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17.5 Calibrating Personal Probabilities of Experts 


CASE STUDY 17.1 


Professionals who need to help others make decisions often use personal probabili- 
ties themselves, and their personal probabilities are sometimes subject to the same 
distortions discussed in this chapter. For example, your doctor may observe your 
symptoms and give you an assessment of the likelihood that you have a certain dis- 
ease—but fail to take into account the baseline rate for the disease. 

Weather forecasters routinely use personal probabilities to deliver their predic- 
tions of tomorrow’s weather. They attach a number to the likelihood that it will rain 
in a certain area, for example. Those numbers are a composite of information about 
what has happened in similar circumstances in the past and the forecaster’s knowl- 
edge of meteorology. 


Using Relative Frequency 
to Check Personal Probabilities 


As consumers, we would like to know how accurate the probabilities delivered by 
physicians, weather forecasters, and similar professionals are likely to be. To discuss 
what we mean by accuracy, we need to revert to the relative-frequency interpretation 
of probability. For example, if we routinely listen to the same professional weather 
forecaster, we could check his or her predictions using the relative-frequency mea- 
sure. Each evening, we could record the forecaster’s probability of rain for the next 
day, and then the next day we could record whether it actually rained. 

For a perfectly calibrated forecaster, of the many times he or she gave a 30% 
chance of rain, it would actually rain 30% of the time. Of the many times the fore- 
caster gave a 90% chance of rain, it would rain 90% of the time, and so on. We can 
assert that personal probabilities are well calibrated if they come close to meeting 
this standard. 

Notice that we can assess whether probabilities are well calibrated only if we 
have enough repetitions of the event to apply the relative-frequency definition. For 
instance, we will never be able to ascertain whether the late Carl Sagan was well cal- 
ibrated when he made the assessment we saw in Section 16.3 that “the probability 
that the Earth will be hit by a civilization-threatening small world in the next century 
is a little less than one in a thousand.” This event is obviously not one that will be 
repeated numerous times. 


Calibrating Weather Forecasters and Physicians 


Studies have been conducted of how well calibrated various professionals are. Fig- 
ure 17.1 displays the results of two such studies, one for weather forecasters and one 
for physicians. The open circles indicate actual relative frequencies of rain, plotted 
against various forecast probabilities. The dark circles indicate the relative frequency 
with which a patient actually had pneumonia versus his or her physician’s personal 
probability that the patient had it. 


Figure 17.1 

Calibrating weather 
forecasters and 
physicians 

Source: Plous, 1993, p. 223, 
using data from Murphy and 
Winkler (1984) for the weather 
forecasters and Christensen- 
Szalanski and Bushyhead 
(1981) for the physicians. 
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The plot indicates that the weather forecasters were generally quite accurate but 
that, at least for the data presented here, the physicians were not. The weather fore- 
casters were slightly off at the very high end, when they predicted rain with almost 
certainty. For example, of the times they were sure it was going to rain, and gave a 
probability of 1 (or 100%), it rained only about 91% of the time. Still, the weather 
forecasters were well calibrated enough that you could use their assessments to make 
reliable decisions about how to plan tomorrow’s events. 

The physicians were not at all well calibrated. The actual probability of pneumo- 
nia rose only slightly and remained under 15% even when physicians placed it almost 
as high as 90%. As we will see in an example in Section 18.4, physicians tend to over- 
estimate the probability of disease, especially when the baseline risk is low. When your 
physician quotes a probability to you, you should ask if it is a personal probability or 
one based on data from many individuals in your circumstances. 


17.6 ‘Tips for Improving Your Personal 
Probabilities and Judgments 


The research summarized in this chapter suggests methods for improving your own 
decision making when uncertainty and risks are involved. Here are some tips to con- 
sider when making judgments: 
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1. Think of the big picture, including risks and rewards that are not presented to you. 
For example, when comparing insurance policies, be sure to compare coverage as 
well as cost. 


2. When considering how a decision changes your risk, try to find out what the base- 
line risk is to begin with. Try to determine risks on an equal scale, such as the 
drop in number of deaths per 100,000 people rather than the percent drop in death 
rate. 


3. Don’t be fooled by highly detailed scenarios. Remember that excess detail actu- 
ally decreases the probability that something is true, yet the representativeness 
heuristic leads people to increase their personal probability that it is true. 


4. Remember to list reasons why your judgment might be wrong, to provide a more 
realistic confidence assessment. 


5. Do not fall into the trap of thinking that bad things only happen to other people. 
Try to be realistic in assessing your own individual risks, and make decisions 
accordingly. 

6. Be aware that the techniques discussed in this chapter are often used in market- 
ing. For example, watch out for the anchoring effect when someone tries to an- 
chor your personal assessment to an unrealistically high or low value. 


7. If possible, break events into pieces and try to assess probabilities using the in- 
formation in Chapter 16 and in publicly available information. For example, 
Slovic and colleagues (1982, p. 480) note that because the risk of a disabling in- 
jury on any particular auto trip is only about | in 100,000, the need to wear a seat 
belt on a specific trip would seem to be small. However, using the techniques de- 
scribed in Chapter 16, they calculated that over a lifetime of riding in an auto- 
mobile the risk is about .33. It thus becomes much more reasonable to wear a seat 
belt at all times. 


Exercises Asterisked (*) exercises are included in the Solutions at the back of the book. 


1. Explain how the pseudocertainty effect differs from the certainty effect. 


*2. Suppose a television advertisement were to show viewers a product and then 
say, “You might expect to pay $25, $30, or even more for this product. But we 
are offering it for only $16.99.” Explain which of the ideas in this chapter is be- 
ing used to try to exploit viewers. 


3. There are many more termites in the world than there are mosquitoes, but most 
of the termites live in tropical forests. Using the ideas in this chapter, explain 
why most people would think there were more mosquitoes in the world than 
termites. 


4. Suppose a defense attorney is trying to convince the jury that his client’s wallet, 
found at the scene of the crime, was actually planted there by his client’s gar- 
dener. Here are two possible ways he might present this to the jury: 


1S, 


*9, 


10. 


11. 
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Statement A: The gardener dropped the wallet when no one was looking. 


Statement B: The gardener hid the wallet in his sock and when no one was 
looking he quickly reached down and lowered his sock, allowing the wallet 
to drop to the ground. 


a. Explain why the second statement cannot have a higher probability of being 
true than the first statement. 


b. Based on the material in this chapter, to which statement are members of the 
jury likely to assign higher personal probabilities? Explain. 


Explain why you should be cautious when someone tries to convince you of 
something by presenting a detailed scenario. Give an example. 


. A telephone solicitor recently contacted the author to ask for money for a char- 


ity in which typical contributions are in the range of $25 to $50. The solicitor 
said, “We are asking for as much as you can give, up to $300.00.” Do you think 
the amount people give would be different if the solicitor said, “We typically get 
$25 to $50, but give as much as you can.” Explain, using the relevant material 
from this chapter. 


. In this chapter, we learned that one way to lower personal-probability assess- 


ments that are too high is to list reasons why you might be wrong. Explain how 
the availability heuristic might account for this phenomenon. 


. Research by Slovic and colleagues (1982) found that people judged that acci- 


dents and diseases cause about the same number of deaths in the United States, 
whereas in truth diseases cause about 16 times as many deaths as accidents. Us- 
ing the material from this chapter, explain why the researchers found this 
misperception. 

Determine which statement (A or B) has a higher probability of being true and 
explain your answer. Using the material in this chapter, also explain which state- 
ment you think a statistically naive person would think had a higher probability. 


A. A car traveling 120 miles per hour on a two-lane highway will have an 
accident. 


B. A car traveling 120 miles per hour on a two-lane highway will have a 
major accident in which all occupants are instantly killed. 


Explain how an insurance salesperson might try to use each of the following 
concepts to sell you insurance: 

a. Anchoring 

b. Pseudocertainty 

c. Availability 


In the early 1990’s, there were approximately 5 billion people in the world. 
Plous (1993, p. 5) asked readers to estimate how wide a cube-shaped tank would 
have to be to hold all of the human blood in the world. The correct answer is 
about 870 feet, but most people give much higher answers. Explain which of the 
concepts covered in this chapter leads people to give higher answers. 
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12. 


13. 


*14, 


15. 


*16. 


17. 


Barnett (1990) examined front page stories in the New York Times for 1 year, be- 
ginning with October 1, 1988, and found 4 stories related to automobile deaths 
but 51 related to deaths from flying on a commercial jet. These correspond to 
0.08 story per thousand U.S. deaths by automobile and 138.2 stories per thou- 
sand U.S. deaths for commercial jets. He also reported a mid-August 1989 
Gallup Poll finding that 63% of Americans had lost confidence in recent years 
in the safety of airlines. Discuss this finding in the context of the material in this 
chapter. 


Explain how the concepts in this chapter account for each of the following 
scenarios: 


a. Most people rate death by shark attacks to be much more likely than death 
by falling airplane parts, yet the chances of dying from the latter are actually 
30 times greater (Plous, 1993, p. 121). 


b. You are a juror on a case involving an accusation that a student cheated on 
an exam. The jury is asked to assess the likelihood of the statement, “Even 
though he knew it was wrong, the student copied from the person sitting next 
to him because he desperately wants to get into medical school.” The other 
jurors give the statement a high probability assessment although they know 
nothing about the accused student. 


c. Research by Tversky and Kahneman (1982b) has shown that people think 
that words beginning with the letter k are more likely to appear in a typical 
segment of text in English than words with k as the third letter. In fact, there 
are about twice as many words with k as the third letter than words that be- 
gin with k. 

d. A 45-year-old man dies of a heart attack and does not leave a will. 


Suppose you go to your doctor for a routine examination, without any com- 
plaints of problems. A blood test reveals that you have tested positive for a cer- 
tain disease. Based on the ideas in this chapter, what should you ask your doctor 
in order to assess how worried you should be? 


Give one example of how each of the following concepts has had or might have 
an unwanted effect on a decision or action in your daily life: 

a. Conservatism 

b. Optimism 

c. Forgotten base rates 

d. Availability 

Explain which of the concepts in this chapter might contribute to the decision to 
buy a lottery ticket. 


Suppose you have a friend who is willing to ask her friends a few questions and 
then, based on their answers, is willing to assess the probability that those 
friends will get an A in each of their classes. She always assesses the probabil- 
ity to be either .10 or .90. She has made hundreds of these assessments and has 
kept track of whether her friends actually received A’s. How would you deter- 
mine if she is well calibrated? 


18. 
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Guess at the probability that if you ask five people when their birthdays are, you 
will find someone born in the same month as you. For simplicity, assume that 
the probability that a randomly selected person will have the same birth month 
you have is 1/12. Now use the material from Chapter 16 to make a table listing 
the numbers from | to 5 and then fill in the probabilities that you will first en- 
counter someone with your birth month by asking that many people. Determine 
the accumulated probability that you will have found someone with your birth 
month by the time you ask the fifth person. How well calibrated was your ini- 
tial guess? 


Mini-Projects 


. Design and conduct an experiment to try to elicit misjudgments based on one of 


the phenomena described in this chapter. Explain exactly what you did and your 
results. 


. Find and explain an example of a marketing strategy that uses one of the tech- 


niques in this chapter to try to increase the chances that someone will purchase 
something. Do not use an exact example from the chapter, such as “buy one, get 
one free.” 


. Find a journal article that describes an experiment designed to test the kinds of 


biases described in this chapter. Summarize the article, and discuss what con- 
clusions can be made from the research. You can find such articles by searching 
appropriate bibliographic databases and trying key words from this chapter. 


. Estimate the probability of some event in your life using a personal probability, 


such as the probability that a person who passes you on the street will be wear- 
ing a hat. Use an event for which you can keep a record of the relative frequency 
of occurrence over the next week. How well calibrated were you? 


References 


Barnett, A. (1990). Air safety: End of the golden age? Chance 3, no. 2, pp. 8-12. 
Christensen-Szalanski, J. J. J., and J. B. Bushyhead. (1981). Physicians’ use of probabilistic 


information in a real clinical setting. Journal of Experimental Psychology: Human Per- 
ception and Performance 7, pp. 928-935. 


Fischhoff, B., P. Slovic, and S. Lichtenstein. (1977). Knowing with certainty: The appropri- 


ateness of extreme confidence. Journal of Experimental Psychology: Human Perception 
and Performance 3, pp. 552-564. 


Hayward, J. W. (1984). Perceiving ordinary magic. Science and intuitive wisdom. Boulder, 


CO: New Science Library. 


Kahneman, D., and A. Tversky. (1973). On the psychology of prediction. Psychological Re- 


view 80, pp. 237-251. 


334 


PART 3 Understanding Uncertainty in Life 


Kahneman, D., and A. Tversky. (1979). Prospect theory: An analysis of decision under risk. 
Econometrica 47, pp. 263-291. 


Kahneman, D., and A. Tversky. (1982). On the study of statistical intuitions. In D. Kahneman, 
P. Slovic, and A. Tversky (eds.), Judgment under uncertainty: Heuristics and biases 
(Chapter 34). Cambridge, England: Cambridge University Press. 


Koriat, A., S. Lichtenstein, and B. Fischhoff. (1980). Reasons for confidence. Journal of Ex- 
perimental Psychology: Human Learning and Memory 6, pp. 107-118. 


Murphy, A. H., and R. L. Winkler. (1984). Probability forecasting in meteorology. Journal of 
the American Statistical Association 79, pp. 489-500. 


Northcraft, G. B., and M. A. Neale. (1987). Experts, amateurs, and real estate: An anchoring 
and adjustment perspective on property pricing decisions. Organizational Behavior and 
Human Decision Processes 39, pp. 84-97. 


Plous, S. (1993). The psychology of judgment and decision making. New York: McGraw-Hill. 


Slovic, P., B. Fischhoff, and S. Lichtenstein. (1982). Facts versus fears: Understanding per- 
ceived risk. In D. Kahneman, P. Slovic, and A. Tversky (eds.), Judgment under uncer- 
tainty: Heuristics and biases (Chapter 33). Cambridge, England: Cambridge University 
Press. 


Tversky, A. and D. Kahneman. (1982a). Judgment under uncertainty: Heuristics and biases. 
In D. Kahneman, P. Slovic, and A. Tversky (eds.), Judgment under uncertainty: Heuristics 
and biases (Chapter 1). Cambridge, England: Cambridge University Press. 

Tversky, A., and D. Kahneman. (1982b). Availability: A heuristic for judging frequency and 
probability. In D. Kahneman, P. Slovic, and A. Tversky (eds.), Judgment under uncer- 
tainty. Heuristics and biases (Chapter 11). Cambridge, England: Cambridge University 
Press. 


Weinstein, N. D. (1980). Unrealistic optimism about future life events. Journal of Personality 
and Social Psychology 39, pp. 806-820. 


World almanac and book of facts. (1995). Edited by Robert Famighetti. Mahwah, NJ: Funk 
and Wagnalls. 


When Intuition Differs 
from Relative Frequency 


Thought Questions 


1. Do you think it likely that anyone will ever win a state lottery twice in a lifetime? 


2. How many people do you think would need to be in a group in order to be at least 
50% certain that two of them will have the same birthday? 


3. Suppose you test positive for a rare disease, and your original chances of having the 
disease are no higher than anyone else’s—say, close to 1 in 1000. You are told that 
the test has a 10% false positive rate and a 10% false negative rate. In other words, 
whether you have the disease or not, the test is 90% likely to give a correct answer. 
Given that you tested positive, what do you think is the probability that you actually 
have the disease? Do you think the chances are higher or lower than 50%? 


4. If you were to flip a fair coin six times, which sequence do you think would be most 
likely: HHHHHH or HHTHTH or HHHTTT? 


5. If you were faced with the following sets of alternatives, which one would you 
choose in each set? (Choose either A or B and either C or D.) Explain your answer. 


A. A gift of $240, guaranteed 

B. A 25% chance to win $1000 and a 75% chance of getting nothing 
C. A sure loss of $740 

D. A 75% chance to lose $1000 and a 25% chance to lose nothing 
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18.1 Revisiting Relative Frequency 


Recall that the relative-frequency interpretation of probability provides a precise an- 
swer to certain probability questions. As long as we agree on the physical assump- 
tions underlying an uncertain process, we should also agree on the probabilities of 
various outcomes. For example, if we agree that lottery numbers are drawn fairly, 
then we should agree on the probability that a particular ticket will be a winner. 

In many instances, the physical situation lends itself to computing a relative- 
frequency probability, but people ignore that information. In this chapter, we exam- 
ine how probability assessments that should be objective are instead confused by 
incorrect thinking. 


18.2 Coincidences 


EXAMPLE 1 


When I was in college in upstate New York, I visited Disney World in Florida dur- 
ing summer break. While there, I ran into three people I knew from college, none of 
whom were there together. A few years later, I visited the top of the Empire State 
Building in New York City and ran into two friends (who were there together) and 
two additional unrelated acquaintances. Years later, I was traveling from London to 
Stockholm and ran into a friend at the airport in London while waiting for the flight. 
Not only did the friend turn out to be taking the same flight but we had been assigned 
adjacent seats. 


Are Coincidences Improbable? 


These events are all examples of what would commonly be called coincidences. 
They are certainly surprising, but are they improbable? Most people think that coin- 
cidences have low probabilities of occurring, but we shall see that our intuition can 
be quite misleading regarding such phenomena. 

We will adopt the definition of coincidence proposed by Diaconis and Mosteller: 


A coincidence is a surprising concurrence of events, perceived as meaningfully 
related, with no apparent causal connection. (1989, p. 853) 


The mathematically sophisticated reader may wish to consult the article by Diaconis 
and Mosteller, in which they provide some instructions on how to compute proba- 
bilities for coincidences. For our purposes, we need nothing more sophisticated than 
the simple probability rules we encountered in Chapter 16. 

Here are some examples of coincidences that at first glance seem highly im- 
probable: 


Two George D. Brysons 


“My next-door neighbor, Mr. George D. Bryson, was making a business trip some years 
ago from St. Louis to New York. Since this involved weekend travel and he was in no 


EXAMPLE 2 


EXAMPLE 3 
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hurry, . . . and since his train went through Louisville, he asked the conductor, after he 
had boarded the train, whether he might have a stopover in Louisville. 

“This was possible, and on arrival at Louisville he inquired at the station for the lead- 
ing hotel. He accordingly went to the Brown Hotel and registered. And then, just as a 
lark, he stepped up to the mail desk and asked if there was any mail for him. The girl 
calmly handed him a letter addressed to Mr. George D. Bryson, Room 307, that being 
the number of the room to which he had just been assigned. It turned out that the 
preceding resident of Room 307 was another George D. Bryson” (Weaver, 1963, 
pp. 282-283). a 


Identical Cars and Matching Keys 

Plous (1993, p. 154) reprinted an Associated Press news story describing a coincidence 
in which a man named Richard Baker and his wife were shopping on April Fool's Day at 
a Wisconsin shopping center. Mr. Baker went out to get their car, a 1978 maroon Con- 
cord, and drove it around to pick up his wife. After driving for a short while, they no- 
ticed items in the car that did not look familiar. They checked the license plate, and sure 
enough, they had someone else's car. When they drove back to the shopping center (to 
find the police waiting for them), they discovered that the owner of the car they were 
driving was a Mr. Thomas Baker, no relation to Richard Baker. Thus, both Mr. Bakers 
were at the same shopping center at the same time, with identical cars and with match- 
ing keys. The police estimated the odds as “a million to one.” a 


Winning the Lottery Twice 

Moore (1991, p. 278) reported on a New York Times story of February 14, 1986, about 
Evelyn Marie Adams, who won the New Jersey lottery twice in a short time period. Her 
winnings were $3.9 million the first time and $1.5 million the second time. Then, in May 
1988, Robert Humphries won a second Pennsylvania state lottery, bringing his total win- 
nings to $6.8 million. When Ms. Adams won for the second time, the New York Times 
claimed that the odds of one person winning the top prize twice were about 1 in 17 
trillion. a 


Someone, Somewhere, Someday 


Most people think that the events just described are exceedingly improbable, and 
they are. What is not improbable is that someone, somewhere, someday will experi- 
ence those events or something similar. 

When we examine the probability of what appears to be a startling coincidence, 
we ask the wrong question. For example, the figure quoted by the New York Times 
of 1 in 17 trillion is the probability that a specific individual who plays the New Jer- 
sey state lottery exactly twice will win both times (Diaconis and Mosteller, 1989, 
p. 859). However, millions of people play the lottery every day, and it is not surpris- 
ing that someone, somewhere, someday would win twice. 

In fact, Purdue professors Stephen Samuels and George McCabe (cited in Dia- 
conis and Mosteller, 1989, p. 859) calculated those odds to be practically a sure 
thing. They calculated that there was at least a 1 in 30 chance of a double winner in 
a 4-month period and better than even odds that there would be a double winner in 
a 7-year period somewhere in the United States. Further, they used conservative as- 
sumptions about how many tickets past winners purchase. 
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EXAMPLE 4 


EXAMPLE 5 


When you experience a coincidence, remember that there are over 6 billion peo- 
ple in the world and over 290 million in the United States. If something has a 1 in 1 
million probability of occurring to each individual on a given day, it will occur to an 
average of over 290 people in the United States each day and over 6000 people in 
the world each day. Of course, probabilities of specific events depend on individual 
circumstances, but you can see that, quite often, it is not unlikely that something sur- 
prising will happen. 


Sharing the Same Birthday 


Here is a famous example that you can use to test your intuition about surprising events. 
How many people would need to be gathered together to be at least 50% sure that two 
of them share the same birthday? Most people provide answers that are much higher 
than the correct one, which is only 23 people. 

There are several reasons why people have trouble with this problem. If your answer 
was somewhere close to 183, or half the number of birthdays, then you may have con- 
fused the question with another one, such as the probability that someone in the group 
has your birthday or that two people have a specific date as their birthday. 

It is not difficult to see how to calculate the appropriate probability, using our rules 
from Chapter 16. Notice that the only way to avoid two people having the same birth- 
day is if all 23 people have different birthdays. To find that probability, we simply use 
the rule that applies to the word and (Rule 3), thus multiplying probabilities. 

Hence, the probability that the first three people have different birthdays is the prob- 
ability that the second person does not share a birthday with the first, which is 364/365 
(ignoring February 29), and the third person does not share a birthday with either of the 
first two, which is 363/365. (Two dates were already taken.) 

Continuing this line of reasoning, the probability that none of the 23 people share 
a birthday is 


(364)(363)(362) - - - (343)/(365)** = .493 


Therefore, the probability that at least two people share a birthday is what's left of the 
probability, or 1 — .493 = .507. 

If you find it difficult to follow the arithmetic line of reasoning, simply consider this. 
Imagine each of the 23 people shaking hands with the remaining 22 people and asking 
them about their birthday. There would be 253 handshakes and birthday conversations. 
Surely there is a relatively high probability that at least one of those pairs would discover 
a common birthday. 

By the way, the probability of a shared birthday in a group of 10 people is already 
better than 1 in 9, at .117. (There would be 45 handshakes.) With only 50 people, it is 
almost certain, with a probability of .97. (There would be 1225 handshakes.) E 


Unusual Hands in Card Games 


As a final example, consider a card game, like bridge, in which a standard 52-card deck 
is dealt to four players, so they each receive 13 cards. Any specific set of 13 cards is 
equally likely, each with a probability of about 1 in 635 billion. You would probably not 
be surprised to get a mixed hand—say, the 4, 7, and 10 of hearts; 3, 8, 9, and jack of 
spades; 2 and queen of diamonds; and 6, 10, jack, and ace of clubs. Yet, that specific 
hand is just as unlikely as getting all 13 hearts. The point is that any very specific event, 
surprising or not, has extremely low probability; however, there are many such events, 
and their combined probability is quite high. 
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Magicians sometimes exploit the fact that many small probabilities add up to one 
large probability by doing a trick in which they don’t tell you what to expect in advance. 
They set it up so that something surprising is almost sure to happen. When it does, you 
are likely to focus on the probability of that particular outcome, rather than realizing that 
a multitude of other outcomes would have also been surprising and that one of them 
was likely to happen. a 


Most Coincidences Only Seem Improbable 


To summarize, most coincidences seem improbable only if we ask for the probabil- 
ity of that specific event occurring at that time to us. If, instead, we ask the proba- 
bility of it occurring some time, to someone, the probability can become quite large. 
Further, because of the multitude of experiences we each have every day, it is not 
surprising that some of them may appear to be improbable. That specific event is un- 
doubtedly improbable. What is not improbable is that something “unusual” will hap- 
pen to each of us once in a while. 


18.3 The Gambler’s Fallacy 


Another common misperception about random events is that they should be self- 
correcting. Another way to state this is that people think the long-run frequency of 
an event should apply even in the short run. This misperception has classically been 
called the gambler’s fallacy. 

A related misconception is what Tversky and Kahneman (1982) call the belief 
in the law of small numbers, “according to which [people believe that] even small 
samples are highly representative of the populations from which they are drawn” 
(p. 7). They report that “in considering tosses of a coin for heads and tails, for ex- 
ample, people regard the sequence HTHTTH to be more likely than the sequence 
HHHTTT, which does not appear random, and also more likely than the sequence 
HHHHTH, which does not represent the fairness of the coin” (p. 7). Remember that 
any specific sequence of heads and tails is just as likely as any other if the coin is 
fair, so the idea that the first sequence is more likely is a misperception. 


Independent Chance Events Have No Memory 


The gambler’s fallacy can lead to poor decision making, especially if applied to gam- 
bling. For example, people tend to believe that a string of good luck will follow a 
string of bad luck in a casino. Unfortunately, independent chance events have no 
such memory. Making 10 bad gambles in a row doesn’t change the probability that 
the next gamble will also be bad. 


When the Gambler’s Fallacy May Not Apply 


Notice that the gambler’s fallacy applies to independent events. Recall from Chap- 
ter 16 that independent events are those for which occurrence on one occasion gives 
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no information about occurrence on the next occasion, as with successive flips of a 
coin. The gambler’s fallacy may not apply to situations where knowledge of one out- 
come affects probabilities of the next. For instance, in card games using a single 
deck, knowledge of what cards have already been played provides information about 
what cards are likely to be played next. If you normally receive lots of mail but have 
received none for two days, you would probably (correctly) assess that you are likely 
to receive more than usual the next day. 


18.4 Confusion of the Inverse 


Consider the following scenario, discussed by Eddy (1982). You are a physician. 
One of your patients has a lump in her breast. You are almost certain that it is be- 
nign; in fact, you would say there is only a 1% chance that it is malignant. But just 
to be sure, you have the patient undergo a mammogram, a breast X ray designed to 
detect cancer. 

You know from the medical literature that mammograms are 80% accurate for 
malignant lumps and 90% accurate for benign lumps. In other words, if the lump is 
truly malignant, the test results will say that it is malignant 80% of the time and will 
falsely say it is benign 20% of the time. If the lump is truly benign, the test results 
will say so 90% of the time and will falsely declare that it is malignant only 10% of 
the time. 

Sadly, the mammogram for your patient is returned with the news that the lump 
is malignant. What are the chances that it is truly malignant? 

Eddy posed this question to 100 physicians. Most of them thought the probabil- 
ity that the lump was truly malignant was about 75% or .75. In truth, given the prob- 
abilities as described, the probability is only .075. The physicians’ estimates were 10 
times too high! 

When he asked them how they arrived at their answers, Eddy realized that the 
physicians were confusing the actual question with a different question: “When 
asked about this, the erring physicians usually report that they assumed that the prob- 
ability of cancer given that the patient has a positive X ray was approximately equal 
to the probability of a positive X ray in a patient with cancer” (1982, p. 254). 


Robyn Dawes has called this phenomenon confusion of the inverse (Plous, 
1993, p. 132). The physicians were confusing the probability of cancer given a pos- 
itive X ray with its inverse, the probability of a positive X ray, given that the patient 
has cancer. 


Determining the Actual Probability 


It is not difficult to see that the correct answer to the question posed to the physicians 
by Eddy (in the previous section) is indeed .075. Let’s construct a hypothetical table 
of 100,000 women who fit this scenario. In other words, these are women who would 
present themselves to the physician with a lump for which the probability that it was 
malignant seemed to be about 1%. Thus, of the 100,000 women, about 1%, or 1000 


CHAPTER 18 When Intuition Differs from Relative Frequency 341 


Table 18.1 Breakdown of Actual Status versus Test Status 
for a Rare Disease 


Test Shows Test Shows 

Malignant Benign Total 
Actually malignant 800 200 1,000 
Actually benign 9,900 89,100 99,000 
Total 10,700 89,300 100,000 


of them, would have a malignant lump. The remaining 99%, or 99,000, would have 
a benign lump. 

Further, given that the test was 80% accurate for malignant lumps, it would show 
a malignancy for 800 of the 1000 women who actually had one. Given that it was 
90% accurate for the 99,000 women with benign lumps, it would show benign for 
90%, or 89,100 of them and malignant for the remaining 10%, or 9900 of them. 
Table 18.1 shows how the 100,000 women would fall into these possible categories. 

Let’s return to the question of interest. Our patient has just received a positive 
test for malignancy. Given that her test showed malignancy, what is the actual prob- 
ability that her lump is malignant? Of the 100,000 women, 10,700 of them would 
have an X ray show malignancy. But of those 10,700 women, only 800 of them ac- 
tually have a malignant lump. Thus, given that the test showed a malignancy, the 
probability of malignancy is just 800/10,700 = 8/107 = .075. 


The Probability of False Positives 


Many physicians are guilty of confusion of the inverse. Remember, in a situation 
where the base rate for a disease is very low and the test for the disease is less than 
perfect, there will be a relatively high probability that a positive test result is a false 
positive. 

If you ever find yourself in a situation similar to the one just described, you may 
wish to construct a table like Table 18.1. 


To determine the probability of a positive test result being accurate, you need 
only three pieces of information: 


1. The base rate or probability that you are likely to have the disease, 
without any knowledge of your test results 

2. The sensitivity of the test, which is the proportion of people who 
correctly test positive when they actually have the disease 

3. The specificity of the test, which is the proportion of people who 
correctly test negative when they don’t have the disease 
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Notice that items 2 and 3 are measures of the accuracy of the test. They do not 
measure the probability that someone has the disease when they test positive or the 
probability that they do not have the disease when they test negative. Those proba- 
bilities, which are obviously the ones of interest to the patient, can be computed by 
constructing a table similar to Table 18.1. They can also be computed by using a for- 
mula called Bayes’ Rule, given in the For Those Who Like Formulas section at the 
end of this chapter. 


Streak Shooting in Basketball: Reality or Illusion? 
Source: Tversky and Gilovich (Winter 1989). 


We have learned in this chapter that people’s intuition, when it comes to assessing 
probabilities, is not very good, particularly when their wishes for certain outcomes 
are motivated by outside factors. Tversky and Gilovich (Winter 1989) decided to 
compare basketball fans’ impressions of “streak shooting” with the reality evidenced 
by the records. 

First, they generated phony sequences of 21 alleged “hits and misses” in shoot- 
ing baskets and showed them to 100 knowledgeable basketball fans. Without telling 
them the sequences were faked, they asked the fans to classify each sequence as 
“chance shooting,’ in which the probability of a hit on each shot was unrelated to 
previous shots; “streak shooting,” in which the runs of hits and misses were longer 
than would be expected by chance; or “alternating shooting,” in which runs of hits 
and misses were shorter than would be expected by chance. They found that people 
tended to think that streaks had occurred when they had not. In fact, 65% of the re- 
spondents thought the sequence that had been generated by “chance shooting” was 
in fact “streak shooting.” 

To give you some idea of the task involved, decide which of the following two 
sequences of 10 successes (S) and 11 failures (F) you think is more likely to be the 
result of “chance shooting”: 


Sequence 1: FFSSSFSFFFSSSSFSFFFSF 
Sequence 2: FSFFSFSFFFSFSSFSFSSSF 


Notice that each sequence represents 21 shots. In “chance shooting,” the propor- 
tion of throws on which the result is different from the previous throw should be 
about one-half. If you thought sequence 1 was more likely to be due to chance shoot- 
ing, you’re right. Of the 20 throws that have a preceding throw, exactly 10 are dif- 
ferent. In sequence 2, 14 of 20, or 70%, of the shots differ from the previous shot. If 
you selected sequence 2, you are like the fans tested by Tversky and Gilovich. The 
sequences with 70% and 80% alternating shots were most likely to be selected (er- 
roneously) as being the result of “chance shooting.” 

To further test the idea that basketball fans (and players) see patterns in shooting 
success and failure, Tversky and Gilovich asked questions about the probability of 
successful hitting after hitting versus after missing. For example, they asked the fol- 
lowing question of 100 basketball fans: 
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When shooting free throws, does a player have a better chance of making his 
second shot after making his first shot than after missing his first shot? (1989, 
p. 20) 


Sixty-eight percent of the respondents said yes; 32% said no. They asked mem- 
bers of the Philadelphia 76ers basketball team the same question, with similar re- 
sults. A similar question about ordinary shots elicited even stronger belief in streaks, 
with 91% responding that the probability of making a shot was higher after having 
just made the last two or three shots than after having missed them. 

What about the data on shooting? The researchers examined data from several 
NBA teams, including the Philadelphia 76ers, the New Jersey Nets, the New York 
Knicks, and the Boston Celtics. In this case study, we examine the data they re- 
ported for free throws. These are throws in which action stops and the player 
stands in a fixed position, usually for two successive attempts to put the ball in the 
basket. Examining free-throw shots removes the possible confounding effect that 
members of the other team would more heavily guard a player they perceive as be- 
ing “hot.” 

Tversky and Gilovich reported free-throw data for nine members of the Boston 
Celtics basketball team. They examined the long-run frequency of a hit on the sec- 
ond free throw after a hit on the first one, and after a miss on the first one. Of the 
nine players, five had a higher probability of a hit after a miss, whereas four had a 
higher probability of a hit after a hit. In other words, the perception of 65% of the 
fans that the probability of a hit was higher after just receiving a hit was not sup- 
ported by the actual data. 

Tversky and Gilovich looked at other sequences of hits and misses from the 
NBA teams, in addition to generating their own data in a controlled experiment us- 
ing players from Cornell University’s varsity basketball teams. They analyzed the 
data in a variety of ways, but they could find no evidence of a “hot hand” or “streak 
shooting.” They conclude: 


Our research does not tell us anything in general about sports, but it does sug- 
gest a generalization about people, namely that they tend to “detect” patterns 
even where none exist, and to overestimate the degree of clustering in sports 
events, as in other sequential data. We attribute the discrepancy between the 
observed basketball statistics and the intuitions of highly interested and in- 
formed observers to a general misconception of the laws of chance that in- 
duces the expectation that random sequences will be far more balanced than 
they generally are, and creates the illusion that there are patterns of streaks in 
independent sequences. (1989, p. 21) 


The research by Tversky and Gilovich has not gone unchallenged. For addi- 
tional reading, see the articles by Hooke (1989) and by Larkey, Smith, and Kadane 
(1989). They argue that just because Tversky and Gilovich did not find evidence of 
“streak shooting” in the data they examined doesn’t mean that it doesn’t exist, 
sometimes. 
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18.5 Using Expected Values to Make Wise Decisions 


In Chapter 16, we learned how to compute the expected value of numerical out- 
comes when we know the outcomes and their probabilities. Using this information, 
you would think that people would make decisions that allowed them to maximize 
their expected monetary return. But people don’t behave this way. If they did, they 
would not buy lottery tickets or insurance. 

Businesses like insurance companies and casinos rely on the theory of expected 
value to stay in business. Insurance companies know that young people are more 
likely than middle-aged people to have automobile accidents and that older people 
are more likely to die of nonaccidental causes. They determine the prices of auto- 
mobile and life insurance policies accordingly. 

If individuals were solely interested in maximizing their monetary gains, they 
would use expected value in a similar manner. For example, in Chapter 16, Example 
15, we illustrated that for the California Decco lottery game, there was an average 
loss of 35 cents for each ticket purchased. Most lottery players know that there is an 
expected loss for every ticket purchased, yet they continue to play. Why? Probably 
because the excitement of playing and possibly winning has intrinsic, nonmonetary 
value that compensates for the expected monetary loss. 

Social scientists have long been intrigued with how people make decisions, and 
much research has been conducted on the topic. The most popular theory among 
early researchers, in the 1930s and 1940s, was that people made decisions to maxi- 
mize their expected utility. This may or may not correspond to maximizing their ex- 
pected dollar amount. The idea was that people would assign a worth or utility to 
each outcome and choose whatever alternative yielded the highest expected value. 

More recent research has shown that decision making is influenced by a number 
of factors and can be a complicated process. (Plous [1993] presents an excellent 
summary of much of the research on decision making.) The way in which the deci- 
sion is presented can make a big difference. For example, Plous (1993, p. 97) dis- 
cusses experiments in which respondents were presented with scenarios similar to 
the following: 


If you were faced with the following alternatives, which would you choose? Note 
that you can choose either A or B and either C or D. 


A. A gift of $240, guaranteed 

B. A 25% chance to win $1000 and a 75% chance of getting nothing 

C. A sure loss of $740 

D. A 75% chance to lose $1000 and a 25% chance to lose nothing 

When asked to choose between A and B, the majority of people chose the sure 
gain represented by choice A. Notice that the expected value under choice B is $250, 
which is higher than the sure gain of $240 from choice A, yet people prefer choice A. 

When asked to choose between C and D, the majority of people chose the gam- 


ble rather than the sure loss. Notice that the expected value under choice D is $750, 
representing a larger expected loss than the $740 presented in choice C. For dollar 
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amounts and probabilities of this magnitude, people tend to value a sure gain, but are 
willing to take a risk to prevent a loss. 

The second set of choices (C and D) is similar to the decision people must make 
when deciding whether to buy insurance. The cost of the premium is the sure loss. 
The probabilistic choice represented by alternative D is similar to gambling on 
whether you will have a fire, burglary, accident, and so on. Why then do people 
choose to gamble in the scenario just presented, yet tend to buy insurance? 

As Plous (1993) explains, one factor seems to be the magnitudes of the proba- 
bilities attached to the outcomes. People tend to give small probabilities more weight 
than they deserve for their face value. Losses connected with most insurance poli- 
cies have a low probability of actually occurring, yet people worry about them. Plous 
(1993, p. 99) reports on a study in which people were presented with the following 
two scenarios: 


Alternative A: A 1 in 1000 chance of winning $5000 
Alternative B: A sure gain of $5 

Alternative C: A 1 in 1000 chance of losing $5000 
Alternative D: A sure loss of $5 


About three-fourths of the respondents presented with scenario A and B chose 
the risk presented by alternative A. This is similar to the decision to buy a lottery 
ticket, where the sure gain corresponds to keeping the money rather than using it to 
buy a ticket. 

For scenario C and D, nearly 80% of respondents chose the sure loss (D). This 
is the situation that results in the success of the insurance industry. Of course, the 
dollar amounts are also important. A sure loss of $5 may be easy to absorb, while 
the risk of losing $5000 may be equivalent to the risk of bankruptcy. 


How Bad Is a Bet on the British Open? 
Source: Larkey (1990), pp. 24-26. 


Betting on sports events is big business all over the world, yet if people made deci- 
sions solely on the basis of maximizing expected dollar amounts, they would not bet. 
Larkey (1990) decided to look at the odds given for the 1990 British Open golf tour- 
nament “by one of the largest betting shops in Britain” (p. 25) to see how much the 
betting shop stood to gain and the bettors stood to lose. 

Here is how betting on sports events works. The bookmaker sets odds on each of 
the possible outcomes, which in this case were individual players winning the tour- 
nament. For example, for the 1990 British Open, the bookmaker we will examine set 
odds of 50 to 1 for Jack Nicklaus. You pay one dollar, pound, or whatever to play. If 
your outcome happens, you win the amount given in the odds, plus get your money 
back. For example, if you placed a $1 bet on Jack Nicklaus winning and he won, you 
would receive $50 (minus a handling fee, which we will ignore for this discussion), 
in addition to getting your $1 back. Thus, the two possible outcomes are that you 
gain $50 or you lose $1. 
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Table 18.2 Odds Given on the Top Ranked Players 
in the 1990 British Open 


Player Odds Probability 


1. Nick Faldo 6 to 1 .1429 
2. Greg Norman 9 to 1 .1000 
3. Jose-Maria Olazabal 14 to 1 .0667 
4. Curtis Strange 14 to 1 .0667 
5. lan Woosnam 14 to 1 .0667 
6. Seve Ballesteros 16 to 1 .0588 
7. Mark Calcavecchia 16 to 1 .0588 
8. Payne Stewart 16 to 1 .0588 
9. Bernhard Langer 22 to 1 0435 
10. Paul Azinger 28 to 1 0345 
11. Ronan Rafferty 33 to 1 0294 
12. Fred Couples 33 to 1 0294 
13. Mark McNulty 33 to 1 0294 


Source: Larkey (1990). 


Table 18.2 shows the 13 players who were assigned the highest odds by the bet- 
ting shop we are using, along with the odds the shop assigned to each of the players. 
The table also lists the probability of winning that would be required for each player, 
in order for someone who bet on that player to have a break-even expected value. 

Let’s look at how the probabilities in Table 18.2 are computed. Suppose you bet 
on Nick Faldo. The odds given for him were 6 to 1. Therefore, if you bet $1 on him 
and won, you would have a gain of $6. If you lost, you would have a net “gain” of 
—$1. Let’s call the probability that Faldo wins p and the probability that he doesn’t 
win | — p. What value of p would allow you to break even—that is, have an expected 
value of zero? 


EV = ($6) xX p + (-$1) X (1 — p) = $7 X p - $1 


It should be obvious that if p = 1/7, the expected value would be zero. The value 
listed in Table 18.2 for Faldo is 1/7 = .1429. Probabilities for other players are de- 
rived using the same method, and the general formula should be obvious. If the odds 
are n to | for a particular player, someone who bets on that player will have an ex- 
pected gain (or loss) of zero if the probability of the player winning is 1/(n + 1). In 
other words, for the odds presented, this would be a fair bet if the player’s actual 
probability of winning was the probability listed in the table. 

If the bookmaker had set fair odds, so that both the house and those placing bets 
had expected values of zero, then the probabilities for all of the players should sum 
to 1.00. The probabilities listed in Table 18.2 sum to .7856. But those weren’t the 
only players for whom bets could be placed. Larkey (1990, p. 25) lists a total of 40 
players, which is still apparently only a subset of the 156 choices. The 40 players 
listed by Larkey have probabilities summing to 1.27. With the odds set by the book- 
maker, the house has a definite advantage, even after taking off the “handling fee.” 
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It is impossible to compute the true expected value for the house because we would 
need to know both the true probabilities of winning for each player and the number of 
bets placed on each player. Also, notice that just because the house has a positive ex- 
pected value does not mean that it will come out ahead. The winner of the tournament 
was Nick Faldo. If everyone had bet $1 on Nick Faldo, the house would have to pay 
each bettor $6 in addition to the $1 bet, which would clearly be a losing proposition. 
Bookmakers rely on the fact that many thousands of bets are made, so the aggregate 
win (or loss) per bet for them should be very close to the expected value. 


For Those Who Like Formulas 


Conditional Probability 


The conditional probability of event A, given knowledge that event B happened, is 
denoted by P(AIB). 


Bayes’ Rule 


Suppose A, and A, are complementary events with known probabilities. In other 
words, they are mutually exclusive and their probabilities sum to 1. For example, 
they might represent presence and absence of a disease in a randomly chosen indi- 
vidual. Suppose B is another event such that the conditional probabilities P(BIA,) 
and P(BIA2) are both known. For example, B might be the probability of testing pos- 
itive for the disease. We do not need to know P(B). 

Then Bayes’ Rule determines the conditional probability in the other direction: 


P(A) )P(BIA)) 


P(A\|B) = P(A,)P(BIA,) + P(A2)P(BIA2) 


For example, Bayes’ Rule can be used to determine the probability of having a dis- 
ease given that the test is positive. The base rate, sensitivity, and specificity would 
all need to be known. Bayes’ Rule is easily extended to more than two mutually ex- 
clusive events, as long as the probability of each one is known and the probability of 
B conditional on each one is known. 


Exercises 


Asterisked (*) exercises are included in the Solutions at the back of the book. 


1. Although it’s not quite true, suppose the probability of having a male child 
(M) is equal to the probability of having a female child (F). A couple has four 
children. 

a. Are they more likely to have FFFF or to have MFFM? Explain your answer. 


b. Which sequence in part a of this exercise would a belief in the law of small 
numbers cause people to say had higher probability? Explain. 
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*2. 


*6. 


"o. 


10. 


11. 


c. Is a couple with four children more likely to have four girls or to have two 
children of each sex? Explain. (Assume the decision to have four children 
was independent of the sex of the children.) 


Give an example of a sequence of events to which the gambler’s fallacy would 
not apply because the events are not independent. 


. Explain why it is not at all unlikely that in a class of 50 students two of them 


will have the same last name. 


. Suppose two sisters are reunited after not seeing each other since they were 3 


years old. They are amazed to find out that they are both married to men named 
James and that they each have a daughter named Jennifer. Explain why this is 
not so amazing. 


. Why is it not surprising that the night before a major airplane crash several peo- 


ple will have dreams about an airplane disaster? If you were one of those peo- 
ple, would you think that something amazing had occurred? 

Find a dollar bill or other item with a serial number. Write down the number. 
I predict that there is something unusual about it or some pattern to it. Explain 
what is unusual about it and how I was able to make that prediction. 


. The U.C. Berkeley Wellness Encyclopedia (1991) contains the following state- 


ment in its discussion of HIV testing: “In a high-risk population, virtually all 

people who test positive will truly be infected, but among people at low risk the 

false positives will outnumber the true positives. Thus, for every infected person 
correctly identified in a low-risk population, an estimated ten noncarriers [of the 

HIV virus] will test positive” (p. 360). 

a. Suppose you have a friend who is part of this low-risk population but who 
has just tested positive. Using the numbers in the statement, calculate the 
probability that the person actually carries the virus. 

b. Your friend is understandably upset and doesn’t believe that the probability 
of being infected with HIV isn’t really near 1. After all, the test is accurate 
and it came out positive. Explain to your friend how the Wellness Encyclo- 
pedia statement can be true, even though the test is very accurate both for 
people with HIV and for people who don’t carry it. If it’s easier, you can 
make up numbers to put in a table to support your argument. 


. In financial situations, are businesses or individuals more likely to make use of 


expected value for making decisions? Explain. 


Many people claim that they can often predict who is on the other end of the 
phone when it rings. Do you think that phenomenon has a normal explanation? 
Explain. 

Suppose a rare disease occurs in about | out of 1000 people who are like you. 
A test for the disease has sensitivity of 95% and specificity of 90%. Using the 
technique described in this chapter, compute the probability that you actually 
have the disease, given that your test results are positive. 

You are at a casino with a friend, playing a game in which dice are involved. 
Your friend has just lost six times in a row. She is convinced that she will win 
on the next bet because she claims that, by the law of averages, it’s her turn to 


*12. 


13. 


14. 


15. 


16. 


*17. 


18. 


19. 


CHAPTER 18 When Intuition Differs from Relative Frequency 349 


win. She explains to you that the probability of winning this game is 40%, and 
because she has lost six times, she has to win four times to make the odds work 
out. Is she right? Explain. 


Using the data in Table 18.1 about a hypothetical population of 100,000 women 
tested for breast cancer, find the probability of each of the following events: 
*ą. A woman whose test shows a malignant lump actually has a benign lump. 
*b. A woman who actually has a benign lump has a test that shows a malignant 
lump. 
*c. A woman with unknown status has a test showing a malignant lump. 
Using the data in Table 18.1, give numerical values and explain the meaning of 
the sensitivity and the specificity of the test. 
Explain why the story about George D. Bryson, reported in Example 1 in this 
chapter, is not all that surprising. 
A statistics professor once made a big blunder by announcing to his class of 
about 50 students that he was fairly certain that someone in the room would 
share his birthday. We have already learned that there is a 97% chance that there 
will be 2 people in a room of 50 with a common birthday. Given that informa- 
tion, why was the professor’s announcement a blunder? Do you think he was 
successful in finding a match? Explain. 
Suppose the sensitivity of a test is .90. Give either the false positive or the false 
negative rate for the test, and explain which you are providing. Could you pro- 
vide the other one without additional information? Explain. 
Suppose a friend reports that she has just had a string of “bad luck” with her car. 
She had three major problems in as many months and now has replaced many 
of the worn parts with new ones. She concludes that it is her turn to be lucky and 
that she shouldn’t have any more problems for a while. Is she using the gam- 
bler’s fallacy? Explain. 
If you wanted to pretend that you could do psychic readings, you could perform 
“cold readings” by inviting people you do not know to allow you to tell them 
about themselves. You would then make a series of statements like 


“I see that there is some distance between you and your mother that bothers 


” 


you. 
“It seems that you are sometimes less sure of yourself than you indicate.” 


“You are thinking of two men in your life [or two women, for a male 
client], one of whom is sort of light-complexioned and the other of whom 
is slightly darker. Do you know who I mean?” 


In the context of the material in this chapter, explain why this trick would often 
work to convince people that you are indeed psychic. 


Explain why it would be much more surprising if someone were to flip a coin 
and get six heads in a row after telling you they were going to do so than it 
would be to simply watch them flip the coin six times and observe six heads in 
a row. 
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*20. We learned in this chapter that one idea researchers have tested was that when 
forced to make a decision, people choose the alternative that yields the highest 
expected value. 


*a, 


b. 


If that were the case, explain which of the following two choices people 
would make: 
Choice A: Accept a gift of $10. 
Choice B: Take a gamble with probability 1/1000 of winning $9000 
and 999/1000 of winning nothing. 


Explain how the situation in part a resembles the choices people have when 
they decide whether to buy lottery tickets. 


21. It is time for the end-of-summer sales. One store is offering bathing suits at 50% 
of their usual cost, and another store is offering to sell you two for the price of 
one. Assuming the suits originally all cost the same amount, which store is of- 
fering a better deal? Explain. 


*22. Refer to Case Study 18.2, in which the relationship between betting odds and 
probability of occurrence is explained. 


a. 


*b. 


C. 


Suppose you are offered a bet on an outcome for which the odds are 2 to 1 
and there is no handling fee. For you to have a break-even expected value of 
zero, what would the probability of the outcome occurring have to be? 


Suppose you believe that the probability that your team will win a game is 
1/4. What odds should you be offered in order to place a bet in which you 
think you have a break-even expected value? 

Explain what the two possible outcomes would be for the situation in part b, 


assuming you were offered the break-even odds and bet $1. Show that the 
expected value would indeed be zero. 


23. Suppose you are trying to decide whether to park illegally while you attend 
class. If you get a ticket, the fine is $25. If you assess the probability of getting 
a ticket to be 1/100, what is the expected value for the fine you will have to pay? 
Under those circumstances, explain whether you would be willing to take the 
risk and why. (Note that there is no correct answer to the last part of the ques- 
tion; it is designed to test your reasoning.) 


24. 


Comment on the following unusual lottery events, including a probability 
assessment. 


a. 


b. 


On September 11, 2002, the first anniversary of the 9/11 attack on the World 
Trade Center, the winning number for the New York State lottery was 911. 


To play the Maryland Pick 4 lottery, players choose four numbers from the 
digits 0 to 9. The game is played twice every day, at midday and in the 
evening. In 1999, holiday players who decided to repeat previous winning 
numbers got lucky. At midday on December 24, the winning numbers were 
7535, exactly the same as on the previous evening. And on New Year’s Eve, 
the evening draw produced the numbers 9521—exactly the same as the pre- 
vious evening. 
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Mini-Projects 


1. Find out the sensitivity and specificity of a common medical test. Calculate the 
probability of a true positive for someone who tests positive with the test, as- 
suming the rate in the population is 1 per 100; then calculate the probability as- 
suming the rate in the population is | per 1000. 


2. Ask four friends to tell you their most amazing coincidence story. Use the ma- 
terial in this chapter to assess how surprising each of the stories is to you. Pick 
one of the stories and try to approximate the probability of that specific event 
happening to your friend. 


3. Conduct a survey in which you ask 20 people the two scenarios presented in 
Thought Question 5 at the beginning of this chapter and discussed in Section 
18.5. Record the percentage who choose alternative A over B and the percent- 
age who choose alternative C over D. 


a. Report your results. Are they consistent with what other researchers have 
found? (Refer to p. 344.) Explain. 


b. Explain how you conducted your survey. Discuss whether you overcame the 
potential difficulties with surveys that were discussed in Chapter 4. 
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Making Judgments 
from Surveys 
and Experiments 


In Part 1, you learned how data should be collected in order to be meaningful. 
In Part 2, you learned some simple things you could do with data, and in Part 
3, you learned that uncertainty can be quantified and can lead to worthwhile 
information about the aggregate. 

In Part 4, you will learn about the final steps that allow us to turn data into 
useful information. You will learn how to use samples collected in surveys and 
experiments to say something intelligent about what is probably happening in 
an entire population. 

Chapters 19 to 24 are somewhat more technical than previous chapters. Try 
not to get bogged down in the details. Remember that the purpose of this ma- 
terial is to enable you to say something about a whole population after exam- 
ining just a small piece of it in the form of a sample. The book concludes with 
Chapter 27, which provides 10 case studies that will reinforce your awareness 


that you have indeed become an educated consumer of statistical information. 
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The Diversity of Samples 
from the Same Population 


Thought Questions 


1. Suppose that 40% of a large population disagree with a proposed new law. In parts 
a and b, think about the role of the sample size when you answer the question. 


a. 


If you randomly sample 10 people, will exactly 4 (40%) disagree with the law? 
Would you be surprised if only 2 of the people in the sample disagreed with the 
law? How about if none of the sample disagreed with it? 


. Now suppose you randomly sample 1000 people. Will exactly 400 (40%) disagree 


with the law? Would you be surprised if only 200 of the people in the sample dis- 
agreed with the law? How about if none of the sample disagreed with it? 


. Explain how the long-run relative-frequency interpretation of probability and the 


gambler’s fallacy helped you answer parts a and b. 


2. Suppose the mean weight of all women at a large university is 135 pounds, with a 
standard deviation of 10 pounds. 


a. 


b. 


Recalling the Empirical Rule from Chapter 8, about bell-shaped curves, in what 
range would you expect 95% of the women’s weights to fall? 


If you were to randomly sample 10 women at the university, how close do you 
think their average weight would be to 135 pounds? If you sample 1000 women, 
would you expect the average weight to be closer to 135 pounds than it would 
be for the sample of only 10 women? 


3. Recall from Chapter 4 that a survey of 1000 randomly selected individuals has a mar- 
gin of error of about 3%, so that the results are accurate to within plus or minus 3% 
most of the time. Suppose 25% of adults believe in reincarnation. If 10 polls are 
taken, each asking a different random sample of 1000 adults about belief in reincar- 
nation, would you expect each poll to find exactly 25% of respondents expressing be- 
lief in reincarnation? If not, into what range would you expect the 10 sample 
proportions to reasonably fall? 
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19.1 Setting the Stage 


This chapter serves as an introduction to the reasoning that allows pollsters and re- 
searchers to make conclusions about entire populations on the basis of a relatively 
small sample of individuals. The reward for understanding the material presented in 
this chapter will come in the remaining chapters of this book, as you begin to real- 
ize the power of the statistical tools in use today. 


Working Backward from Samples to Populations 


The first step in this process is to work backward: from a sample to a population. We 
start with a question about a population, such as: How many teenagers are infected 
with HIV? At what average age do left-handed people die? What is the average in- 
come of all students at a large university? We collect a sample from the population 
about which we have the question, and we measure the variable of interest. We can 
then answer the question of interest for the sample. Finally, based on what statisti- 
cians have worked out, we will be able to determine how close the answer from our 
sample is to what we really want to know: the actual answer for the population. 


Understanding Dissimilarity among Samples 


The secret to understanding how things work is to understand what kind of dis- 
similarity we should expect to see in various samples from the same population. 
For example, suppose we knew that most samples were likely to provide an answer 
that is within 10% of the population answer. Then we would also know the reverse 
—the population answer should be within 10% of whatever our specific sample 
gave. Armed only with our sample value, we could make a good guess about the 
population value. You have already seen this idea at work in Chapter 4, when we 
used the margin of error for a sample survey to estimate results for the entire pop- 
ulation. Statisticians have worked out similar techniques for a variety of sample 
measurements. In this and the next two chapters, we will cover some of these tech- 
niques in detail. 


19.2 What to Expect of Sample Proportions 


Suppose we want to know what proportion of a population carries the gene for a cer- 
tain disease. We sample 25 people, and from that sample we make an estimate of the 
true answer. 

Suppose that 40% of the population actually carries the gene. We can think of the 
population as consisting of two types of people: those who do not carry the gene, 
represented as ©, and those who do carry the gene, represented as J. Figure 19.1 
is a conceptual illustration of part of such a population. 
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Figure 19.1 

A slice of a population where 40% are X 
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COOCOOORNNNNENNMNOOCOOCOCOCONMOKNMNONOOCOK 
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KNX NOOOOCOOCOOOCOCOCOSOONNNNH NMR NHNOOCOCOO 
CONNWOKMONOOCONNOCORMMONOCOSO OOOO OOOOOO 
COO”... 


Possible Samples 


What would we find if we randomly sampled 25 people from this population? Would 
we always find 10 people (40%) with the gene and 15 people (60%) without? You 
should know from our discussion of the gambler’s fallacy in Chapter 18 that we 
would not. 

Each person we chose for our sample would have a 40% probability of carrying 
the gene. But remember that the relative-frequency interpretation of probability only 
ensures that we would see 40% of our sample with the gene in the very long run. A 
sample of only 25 people does not qualify as “the very long run.” 

What should we expect to see? Figure 19.2 shows four different random samples 
of 25 people taken from the population shown in Figure 19.1. Here is what we would 
have concluded about the proportion of people who carry the gene, given each of 
those samples: 


Sample 1: Proportion with gene = 12/25 = .48 = 48% 
Sample 2: Proportion with gene = 9/25 = .36 = 36% 
Sample 3: Proportion with gene = 10/25 = .40 = 40% 
Sample 4: Proportion with gene = 7/25 = .28 = 28% 


Notice that each sample gives a different answer, and the sample answer may or may 
not actually match the truth about the population. 
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Figure 19.2 
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Four possible random samples from Figure 19.1 


Samp 


Samp 


Samp 


Samp 


e1: 


e3: 


e4: 
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COWOOCOCONONW OC OCONONWNOOCOOOONN 


In practice, when a researcher conducts a study similar to this one or a pollster 
randomly samples a group of people to measure public opinion, only one sample is 
collected. There is no way to determine whether the sample is an accurate reflection 
of the population. However, statisticians have calculated what to expect for possible 
samples. We call the applicable rule the Rule for Sample Proportions. 


Conditions for Which the Rule 
for Sample Proportions Applies 


The following three conditions must all be met for the Rule for Sample Pro- 
portions to apply: 


1. There exists an actual population with a fixed proportion who have a cer- 
tain trait, opinion, disease, and so on. 
or 


There exists a repeatable situation for which a certain outcome is likely to 
occur with a fixed relative-frequency probability. 

2. A random sample is selected from the population, thus ensuring that the 
probability of observing the characteristic is the same for each sample unit. 

or 
The situation is repeated numerous times, with the outcome each time 
independent of all other times. 

3. The size of the sample or the number of repetitions is relatively large. The 
necessary size depends on the proportion or probability under investiga- 
tion. It must be large enough so that we are likely to see at least five with 
and five without the specified trait. 


EXAMPLE 1 


EXAMPLE 2 


EXAMPLE 3 


EXAMPLE 4 
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Examples of Situations for Which the Rule 
for Sample Proportions Applies 


Here are some examples of situations that meet these conditions. 


Election Polls 


A pollster wants to estimate the proportion of voters who favor a certain candidate. 
The voters are the population units, and favoring the candidate is the opinion of 
interest. a 


Television Ratings 


A television rating firm wants to estimate the proportion of households with television 
sets that are tuned to a certain television program. The collection of all households with 
television sets makes up the population, and being tuned to that particular program is 
the trait of interest. E 


Consumer Preferences 


A manufacturer of soft drinks wants to know what proportion of consumers prefers a 
new mixture of ingredients compared with the old recipe. The population consists of 
all consumers, and the response of interest is preference for the new formula over the 
old one. a 


Testing ESP 


A researcher studying extrasensory perception (ESP) wants to know the probability 
that people can successfully guess which of five symbols is on a hidden card. Each card 
is equally likely to contain each of the five symbols. There is no physical population. 
The repeatable situation of interest is a guess, and the response of interest is a suc- 
cessful guess. The researcher wants to see if the probability of a correct guess is higher 
than 20%, which is what it would be if there were no such thing as extrasensory 
perception. a 


Defining the Rule for Sample Proportions 


The following is what statisticians have determined to be approximately true for the 
situations that have just been described in Examples 1—4 and for similar ones. 


If numerous samples or repetitions of the same size are taken, the frequency 
curve made from proportions from the various samples will be approximately 
bell-shaped. The mean of those sample proportions will be the true proportion 
from the population. The standard deviation will be 


the square root of: (true proportion) X (1 — true proportion)/(sample size) 
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Figure 19.3 

Possible sample propor- 
tions when n = 2400 
and truth = .4 


EXAMPLE 5 
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Using the Rule for Sample Proportions 

Suppose of all voters in the United States, 40% are in favor of Candidate X for presi- 
dent. Pollsters take a sample of 2400 people. What proportion of the sample would be 
expected to favor Candidate X? The rule tells us that the proportion of the sample who 
favor Candidate X could be anything from a bell-shaped curve with mean of .40 (40%) 
and standard deviation of 


the square root of: (.40) X (1 — .40)/2400 = (.4)(.6)/2400 = .24/2400 = .0001 


Thus, the mean is .40 and the standard deviation is .01 or 1/100 or 1%. 

Figure 19.3 shows what we can expect of the sample proportion in this situation. 
Recalling the rule we learned in Chapter 8 about bell-shaped distributions (the Empiri- 
cal Rule), we can also specify that for our sample of 2400 people, 


There is a 68% chance that the sample proportion is between 39% and 41%. 
There is a 95% chance that the sample proportion is between 38% and 42%. 
It is almost certain that the sample proportion is between 37% and 43%. E 


In practice, we have only one sample proportion and we don’t know the true pop- 
ulation proportion. However, we do know how far apart the sample proportion and 
the true proportion are likely to be. That information is contained in the standard de- 
viation, which can be estimated using the sample proportion combined with the 
known sample size. Therefore, when all we have is a sample proportion, we can in- 
deed say something about the true proportion. 


19.3 What to Expect of Sample Means 


In the previous section, the question of interest was the proportion falling into one 
category of a categorical variable. We saw that we could determine an interval of val- 
ues that was likely to cover the sample proportion if we knew the size of the sample 
and the magnitude of the true proportion. 


Figure 19.4 

Four potential samples 
from a population with 
mean = 8, standard de- 
viation = 5 
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We now turn to the case where the information of interest involves the mean or 
means of measurement variables. For example, researchers might want to compare 
the mean age at death for left- and right-handed people. A company that sells oat 
products might want to know the mean cholesterol level people would have if every- 
one had a certain amount of oat bran in their diet. To help determine financial aid 
levels, a large university might want to know the mean income of all students on 
campus who work. 


Possible Samples 


Suppose a population consists of thousands or millions of individuals, and we are 
interested in estimating the mean of a measurement variable. If we sample 25 peo- 
ple and compute the mean of the variable, how close will that sample mean be to 
the population mean we are trying to estimate? Each time we take a sample we will 
get a different sample mean. Can we say anything about what we expect those 
means to be? 

For example, suppose we are interested in estimating the average weight loss 
for everyone who attends a national weight-loss clinic for 10 weeks. Suppose, un- 
known to us, the weight losses for everyone have a mean of 8 pounds, with a stan- 
dard deviation of 5 pounds. If the weight losses are approximately bell-shaped, 
we know from Chapter 8 that 95% of the individuals will fall within 2 standard 
deviations, or 10 pounds, of the mean of 8 pounds. In other words, 95% of the in- 
dividual weight losses will fall between —2 (a gain of 2 pounds) and 18 pounds 
lost. 

Figure 19.4 lists some possible samples that could result from randomly sam- 
pling 25 people from this population; these were indeed the first four samples pro- 
duced by a computer that is capable of simulating such things. The weight losses 
have been put into increasing order for ease of reading. A negative value indicates a 
weight gain. 

Following are the sample means and standard deviations, computed for each of 
the four samples. You can see that the sample means, although all different, are rel- 
atively close to the population mean of 8. You can also see that the sample standard 
deviations are relatively close to the population standard deviation of 5. 


Sample 1: Mean = 8.32 pounds, standard deviation = 4.74 pounds 
Sample 2: Mean = 6.76 pounds, standard deviation = 4.73 pounds 
Sample 3: Mean = 8.48 pounds, standard deviation = 5.27 pounds 
Sample 4: Mean = 7.16 pounds, standard deviation = 5.93 pounds 


Sample 1: 1,1,2,3,4,4,4,5,6,7,7,7,8,8,9,9, 11,11,13,13,14,14,15,16,16 
Sample 2: —2,—2,0,0,3,4,4,4,5,5,6,6,8,8,9,9,9,9,9,10,11,12,13,13,16 
Sample 3: —4,—4,2,3,4,5,7,8,8,9,9,9,9,9,10,10,11,11,11,12,12,13,14,16,18 
Sample 4: —3,—3,—2,0,1,2,2,4,4,5,7,7,9,9, 10,10, 10,11,11,12,12,14,14,14,19 
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EXAMPLE 6 


EXAMPLE 7 


EXAMPLE 8 


Conditions to Which the Rule 
for Sample Means Applies 


As with sample proportions, statisticians have developed a rule to tell us what to ex- 
pect of sample means. 


The Rule for Sample Means applies in both of the following situations: 


1. The population of the measurements of interest is bell-shaped, and a ran- 
dom sample of any size is measured. 

2. The population of measurements of interest is not bell-shaped, but a large 
random sample is measured. A sample of size 30 is usually considered 
“large,” but if there are extreme outliers, it is better to have a larger sample. 


There are only a limited number of situations for which the Rule for Sample 
Means does not apply. It does not apply at all if the sample is not random, and it does 
not apply for small random samples unless the original population is bell-shaped. In 
practice, it is often difficult to get a random sample. Researchers are usually willing 
to use the Rule for Sample Means as long as they can get a representative sample 
with no obvious sources of confounding or bias. 


Examples of Situations to Which the Rule 
for Sample Means Applies 


Following are some examples of situations that meet the conditions for applying the 
Rule for Sample Means. 


Average Weight Loss 


A weight-loss clinic is interested in measuring the average weight loss for participants in 
its program. The clinic makes the assumption that the weight losses will be bell-shaped, 
so the Rule for Sample Means will apply for any sample size. The population of interest 
is all current and potential clients, and the measurement of interest is weight loss. m 


Average Age at Death 


A researcher is interested in estimating the average age at which left-handed adults die, 
assuming they have lived to be at least 50. Because ages at death are not bell-shaped, 
the researcher should measure at least 30 such ages at death. The population of inter- 
est is all left-handed people who live to be at least 50 years old. The measurement of in- 
terest is age at death. o 


Average Student Income 

A large university wants to know the mean monthly income of students who work. The 
population consists of all students at the university who work. The measurement of in- 
terest is monthly income. Because incomes are not bell-shaped and there are likely to be 


EXAMPLE 9 
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outliers (a few people with high incomes), the university should use a large random sam- 
ple of students. The researchers should take particular care to reach the people who are 
actually selected to be in the sample. A large bias could be created if, for example, they 
were willing to replace the desired respondent with a roommate who happened to be 
home when the researchers called. The students working the longest hours, and thus 
making the most money, would probably be hardest to reach by phone and the least 
likely to respond to a mail questionnaire. a 


Defining the Rule for Sample Means 


The Rule for Sample Means is simple: 

If numerous samples of the same size are taken, the frequency curve of means 
from the various samples will be approximately bell-shaped. The mean of this 
collection of sample means will be the same as the mean of the population. 
The standard deviation will be 


population standard deviation/square root of sample size 


Using the Rule for Sample Means 


For our hypothetical weight-loss example, the population mean and standard deviation 
were 8 pounds and 5 pounds, respectively, and we were taking random samples of size 
25. The rule tells us that potential sample means are represented by a bell-shaped curve 
with a mean of 8 pounds and standard deviation of 5/5 = 1.0. (We divide the popula- 
tion standard deviation of 5 by the square root of 25, which also happens to be 5.) 

Therefore, we know the following facts about possible sample means in this situa- 
tion, based on intervals extending 1, 2, and 3 standard deviations from the mean of 8: 


There is a 68% chance that the sample mean will be between 7 and 9. 
There is a 95% chance that the sample mean will be between 6 and 70. 
It is almost certain that the sample mean will be between 5 and 77. 


Figure 19.5 illustrates this situation. If you look at the four hypothetical samples we 
chose (see Figure 19.4), you will see that the sample means range from 6.76 to 8.48, 
well within the range we expect to see using these criteria. a 


Increasing the Size of the Sample 


Suppose we had taken a sample of 100 people instead of 25. Notice that the mean of 
the possible sample means would not change; it would still be 8 pounds, but the stan- 
dard deviation would decrease. It would now be 5/10 = .5, instead of 1.0. There- 
fore, for samples of size 100, here is what we would expect of sample means for the 
weight-loss situation: 
There is a 68% chance that the sample mean will be between 7.5 and 8.5. 
There is a 95% chance that the sample mean will be between 7 and 9. 


It is almost certain that the sample mean will be between 6.5 and 9.5. 
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Figure 19.5 
Possible sample means 
for samples of 25 from 


a bell-shaped population 


with mean = 8 and 
standard deviation = 5 


It’s obvious that the Rule for Sample Means tells us the same thing our common 
sense tells us: Larger samples tend to result in more accurate estimates of population 
values than do smaller samples. 

This discussion presumed that we know the population mean and the standard 
deviation. Obviously, that’s not much use to us in real situations when the popula- 
tion mean is what we are trying to determine. In Chapter 21, we will see how to use 
the Rule for Sample Means to accurately estimate the population mean when all we 
have available is a single sample for which we can compute the mean and the stan- 
dard deviation. 


19.4 What to Expect in Other Situations 


We have discussed two common situations that arise in assessing public opinion, 
conducting medical research, and so on. The first situation arises when we want to 
know what proportion of a population falls into one category of a categorical vari- 
able. The second situation occurs when we want to know the mean of a population 
for a measurement variable. 

There are numerous other situations for which researchers would like to use re- 
sults from a sample to say something about a population or to compare two or more 
populations. Statisticians have determined rules similar to those in this chapter for 
most of the situations researchers are likely to encounter. Those rules are too com- 
plicated for a book of this nature. However, once you understand the basic ideas for 
the two common scenarios covered here, you will be able to understand the results 
researchers present in more complicated situations. The basic ideas we explore ap- 
ply equally to most other situations. You may not understand exactly how re- 
searchers determined their results, but you will understand the terminology and 
some of the potential misinterpretations. 

In the next several chapters, we explore the two basic techniques researchers use 
to summarize their statistical results: confidence intervals and hypothesis testing. 


CASE STUDY 19.1 
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Confidence Intervals 


One basic technique researchers use is to create a confidence interval, which is an 
interval of values that the researcher is fairly sure covers the true value for the 
population. 

We encountered confidence intervals in Chapter 4, when we learned about the 
margin of error. Adding and subtracting the margin of error to the reported sample 
proportion creates an interval that we are 95% “confident” covers the truth. That in- 
terval is the confidence interval. We will explore confidence intervals further in 
Chapters 20 and 21. 


Hypothesis Testing 


The second statistical technique researchers use is called hypothesis testing or sig- 
nificance testing. Hypothesis testing uses sample data to attempt to reject the hy- 
pothesis that nothing interesting is happening—that is, to reject the notion that 
chance alone can explain the sample results. 

We encountered this idea in Chapter 13, when we learned how to determine 
whether the relationship between two categorical variables is “statistically signifi- 
cant.” The hypothesis that researchers set about to reject in that setting was that two 
categorical variables are unrelated to each other. In most research settings, the desired 
conclusion is that the variables under scrutiny are related. Achieving statistical sig- 
nificance is equivalent to rejecting the idea that chance alone can explain the observed 
results. We will explore hypothesis testing further in Chapters 22, 23, and 24. 


Do Americans Really Vote When They Say They Do? 


On November 8, 1994, a historic election took place, in which the Republican Party 
won control of both houses of Congress for the first time since 1952. But how many 
people actually voted? On November 28, 1994, Time magazine (p. 20) reported that 
in a telephone poll of 800 adults taken during the two days following the election, 
56% reported that they had voted. Considering that only about 68% of adults are reg- 
istered to vote, that isn’t a bad turnout. 

But, along with these numbers, Time reported another disturbing fact. They re- 
ported that, in fact, only 39% of American adults had voted, based on information 
from the Committee for the Study of the American Electorate. 

Could it be the case that the results of the poll simply reflected a sample that, by 
chance, voted with greater frequency than the general population? The Rule for 
Sample Proportions can answer that question. Let’s suppose that the truth about the 
population is, as reported by Time, that only 39% of American adults voted. Then the 
Rule for Sample Proportions tells us what kind of sample proportions we can expect 
in samples of 800 adults, the size used by the Time magazine poll. The mean of the 
possibilities is .39, or 39%. The standard deviation is the square root of 
(.39)(.61)/800, which is .017, or 1.7%. 

Therefore, we are almost certain that the sample proportion based on a sample 
of 800 adults should fall within 3 X 1.7% = 5.1% of the truth of 39%. In other 
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Figure 19.6 

Likely sample propor- 
tions who voted, if polls 
of 800 are taken from a 
population in which 

.39 (39%) voted 


.339 39 441 


words, if respondents were telling the truth, the sample proportion should be no 
higher than 44.1%, nowhere near the reported percentage of 56%. Figure 19.6 illus- 
trates the situation. 

In fact, if we combine the Rule for Sample Proportions with what we learned 
about bell-shaped curves in Chapter 8, we can say even more about how unlikely this 
sample result would be. If, in truth, only 39% of the population voted, the standard- 
ized score for the sample proportion of 56% is (.56 — .39)/.017 = 10. We know 
from Chapter 8 that it is virtually impossible to obtain a standardized score of 10. 

Another example of the fact that reported voting tends to exceed actual voting 
occurred in the 1992 U.S. presidential election. According to the World Almanac and 
Book of Facts (1995, p. 631), 61.3% of American adults reported voting in the 1992 
election. In a footnote, the Almanac explains: 


Total reporting voting compares with 55.9 percent of population actually vot- 
ing for president, as reported by Voter News Service. Differences between data 
may be the result of a variety of factors, including sample size, differences in 
the respondents’ interpretation of the questions, and the respondents’ inability 
or unwillingness to provide correct information or recall correct information. 


Unfortunately, because figures are not provided for the size of the sample, we can- 
not assess whether the difference between the actual percentage of 55.9 and the re- 
ported percentage of 61.3 can be explained by the natural variability among possible 
sample proportions. 


For Those Who Like Formulas 


Notation for Population and Sample Proportions 
Sample size = n 
Population proportion = p 


Sample proportion = p, which is read “p-hat” because the p appears to have a 
little hat on it. 
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The Rule for Sample Proportions 


If numerous samples or repetitions of size n are taken, the frequency curve of 
the p’s from the various samples will be approximately bell-shaped. The mean 
of those p’s will be p. The standard deviation will be 


eh =p} 
n 


Notation for Population and Sample Means and Standard Deviations 


Population mean = u (read “mu’), population standard deviation = o 
(read “sigma”) 


Sample mean =X (read “X-bar”), sample standard deviation = s 


The Rule for Sample Means 


If numerous samples of size n are taken, the frequency curve of the X’s 
from the various samples is approximately bell-shaped with mean u and 
standard deviation a/V n 


Another way to write these rules is using the notation for normal distributions from 


Chapter 8: 

P l= = í 
P~ Mp. = and X~ Nu = 
n 


Exercises 


Asterisked (*) exercises are included in the Solutions at the back of the book. 


1. Suppose you want to estimate the proportion of students at your college who are 
left-handed. You decide to collect a random sample of 200 students and ask 
them which hand is dominant. Go through the conditions for which the rule for 
sample proportions applies (p. 358) and explain why the rule would apply to this 
situation. 


2. Refer to Exercise 1. Suppose the truth is that .12 or 12% of the students are left- 
handed, and you take a random sample of 200 students. Use the Rule for Sam- 
ple Proportions to draw a picture similar to Figure 19.3, showing the possible 
sample proportions for this situation. 

3. According to the Sacramento Bee (2 April 1998, p. F5), “A 1997-98 survey of 
1027 Americans conducted by the National Sleep Foundation found that 23% of 
adults say they have fallen asleep at the wheel in the last year.” 

a. Conditions 2 and 3 needed to apply the Rule for Sample Proportions are met 
because this result is based on a large random sample of adults. Explain how 
condition 1 is also met. 

b. The article also said that (based on the same survey) “37 percent of adults re- 
port being so sleepy during the day that it interferes with their daytime 
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#4, 


ET 


10. 


activities.” If, in truth, 40% of all adults have this problem, find the interval 
in which about 95% of all sample proportions should fall, based on samples 
of size 1027. Does the result of this survey fall into that interval? 


c. Suppose a survey based on a random sample of 1027 college students was 
conducted and 25% reported being so sleepy during the day that it interferes 
with their daytime activities. Would it be reasonable to conclude that the 
population proportion of college students who have this problem differs from 
the proportion of all adults who have the problem? Explain. 

A recent Gallup Poll found that of 800 randomly selected drivers surveyed, 70% 

thought they were better-than-average drivers. In truth, in the population, only 

50% of all drivers can be “better than average.” 

a. Draw a picture of the possible sample proportions that would result from 
samples of 800 people from a population with a true proportion of .50. 

*b. Would we be unlikely to see a sample proportion of .70, based on a sample 
of 800 people, from a population with a proportion of .50? Explain, using 
your picture from part a. 


c. Explain the results of this survey using the material from Chapter 17. 


. Suppose you are interested in estimating the average number of miles per gal- 


lon of gasoline your car can get. You calculate the miles per gallon for each of 
the next nine times you fill the tank. Suppose, in truth, the values for your car 
are bell-shaped, with a mean of 25 miles per gallon and a standard deviation of 
1. Draw a picture of the possible sample means you are likely to get based on 
your sample of nine observations. Include the intervals into which 68%, 95%, 
and almost all of the potential sample means will fall. 


. Refer to Exercise 5. Redraw the picture under the assumption that you will col- 


lect 100 measurements instead of only 9. Discuss how the picture differs from 
the one in Exercise 5. 


Give an example of a situation of interest to you for which the Rule for Sample 
Proportions would apply. Explain why the conditions allowing the rule to be ap- 
plied are satisfied for your example. 


. Suppose the population of IQ scores in the town or city where you live is bell- 


shaped, with a mean of 105 and a standard deviation of 15. Describe the fre- 
quency curve for possible sample means that would result from random samples 
of 100 IQ scores. 


. Suppose that 35% of the students at a university favor the semester system, 60% 


favor the quarter system, and 5% have no preference. Is a random sample of 100 
students large enough to provide convincing evidence that the quarter system is 
favored? Explain. 


According to USA Today (20 April 1998, Snapshot), a poll of 8709 adults taken 

in 1976 found that 9% believed in reincarnation, whereas a poll of 1000 adults 

taken in 1997 found that 25% held that belief. 

a. Assuming a proper random sample was used, verify that the sample propor- 
tion for the poll taken in 1976 almost certainly represents the population pro- 
portion to within about 1%. 


b. 
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Based on these results, would you conclude that the proportion of all adults 
who believe in reincarnation was higher in 1997 than it was in 1976? 
Explain. 


*11. Suppose 20% of all television viewers in the country watch a particular 
program. 


*a, 


*b. 


For a random sample of 2500 households measured by a rating agency, de- 
scribe the frequency curve for the possible sample proportions who watch 
the program. 

The program will be canceled if the ratings show less than 17% watching in 
arandom sample of households. Given that 2500 households are used for the 
ratings, is the program in danger of getting canceled? Explain. 


. Draw a picture of the possible sample proportions, similar to Figure 19.3. Il- 


lustrate where the sample proportion of .17 falls on the picture. Use this to 
confirm your answer in part b. 


12. Use the Rule for Sample Means to explain why it is desirable to take as large a 
sample as possible when trying to estimate a population value. 


*13. According to the Sacramento Bee (2 April 1998, p. F5), Americans get an aver- 
age of 6 hours and 57 minutes of sleep per night. A survey of a class of 190 sta- 
tistics students at a large university found that they averaged 7.1 hours of sleep 
the previous night, with a standard deviation of 1.95 hours. 


a. 


*b. 


Assume that the population average for adults is 6 hours and 57 minutes, or 
6.95 hours of sleep per night, with a standard deviation of 2 hours. Draw a 
picture similar to Figure 19.6, illustrating how the Rule for Sample Means 
would apply to sample means for random samples of 190 adults. 

Would the mean of 7.1 hours of sleep obtained from the statistics students be 
a reasonable value to expect for the sample mean of a random sample of 190 
adults? Explain. 


. Can the sample taken in the statistics class be considered to be a representa- 


tive sample of all adults? Explain. 


*14. Explain whether each of the following situations meets the conditions for which 
the Rule for Sample Proportions applies. If not, explain which condition is 
violated. 


*a, 


*b. 


Unknown to the government, 10% of all cars in a certain city do not meet ap- 
propriate emissions standards. The government wants to estimate that per- 
centage, so they take a random sample of 30 cars and compute the sample 
proportion that do not meet the standards. 


The Census Bureau would like to estimate what proportion of households 
have someone at home between 7 p.m. and 7:30 P.M. on weeknights, to de- 
termine whether that would be an efficient time to collect census data. The 
Bureau surveys a random sample of 2000 households and visits them during 
that time to see whether someone is at home. 


. You are interested in knowing what proportion of days in typical years have 


rain or snow in the area where you live. For the months of January and 
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February, you record whether there is rain or snow each day, and then you 
calculate the proportion. 

d. A large company wants to determine what proportion of its employees are 
interested in on-site day care. The company asks a random sample of 100 
employees and calculates the sample proportion who are interested. 


15. Explain whether you think the Rule for Sample Means applies to each of the fol- 


16. 


*17. 


18. 


lowing situations. If it does apply, specify the population of interest and the 
measurement of interest. If it does not apply, explain why not. 


a. A researcher is interested in what the average cholesterol level would be if 
people restricted their fat intake to 30% of calories. He gets a group of pa- 
tients who have had heart attacks to volunteer to participate, puts them on a 
restricted diet for a few months, and then measures their cholesterol. 


b. A large corporation would like to know the average income of the spouses 
of its workers. Rather than go to the trouble to collect a random sample, they 
post someone at the exit of the building at 5 p.m. Everyone who leaves be- 
tween 5 p.m. and 5:30 P.M. is asked to complete a short questionnaire on the 
issue; there are 70 responses. 


c. A university wants to know the average income of its alumni. Staff members 
select a random sample of 200 alumni and mail them a questionnaire. They 
follow up with a phone call to those who do not respond within 30 days. 


d. An automobile manufacturer wants to know the average price for which used 
cars of a particular model and year are selling in a certain state. They are able 
to obtain a list of buyers from the state motor vehicle division, from which 
they select a random sample of 20 buyers. They make every effort to find out 
what those people paid for the cars and are successful in doing so. 


In Case Study 19.1, we learned that about 56% of American adults actually 
voted in the presidential election of 1992, whereas about 61% of a random 
sample claimed that they had voted. The size of the sample was not specified, 
but suppose it were based on 1600 American adults, a common size for such 
studies. 


a. Into what interval of values should the sample proportion fall 68%, 95%, and 
almost all of the time? 


b. Is the observed value of 61% reasonable, based on your answer to part a? 


c. Now suppose the sample had been of only 400 people. Compute a standard- 
ized score to correspond to the reported percentage of 61%. Comment on 
whether you believe people in the sample could all have been telling the 
truth, based on your result. 


Suppose the population of grade-point averages (GPAs) for students at the end 
of their first year at a large university has a mean of 3.1 and a standard devia- 
tion of .5. Draw a picture of the frequency curve for the mean GPA of a random 
sample of 100 students, similar to Figure 19.6. 


The administration of a large university wants to use a random sample of 
students to measure student opinion of a new food service on campus. Admin- 
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istrators plan to use a continuous scale from | to 100, where 1 is complete 
dissatisfaction and 100 is complete satisfaction. They know from past experi- 
ence with such questions that the standard deviation for the responses is going 
to be about 5, but they do not know what to expect for the mean. They want to 
be almost sure that the sample mean is within plus or minus 1 point of the true 
population mean value. How large will their random sample have to be? 


Mini-Projects 


. The goal of this mini-project is to help you verify the Rule for Sample Propor- 


tions firsthand. You will use the population represented in Figure 19.1 to do so. 
It contains 400 individuals, of whom 160 (40%) are &—that is, carry the gene 
for a disease—and the remaining 240 (60%) are ©@—that is, do not carry the 
gene. You are going to draw 20 samples of size 15 from this population. Here 
are the steps you should follow: 


Step 1: Develop a method for drawing simple random samples from this popu- 
lation. One way to do this is to cut up the symbols and put them all into a paper 
bag, shake well, and draw from the bag. There are less tedious methods, but 
make sure you actually get random samples. Explain your method. 


Step 2: Draw a random sample of size 15 and record the number and percent- 
age who carry the gene. 


Step 3: Repeat step 2 a total of 20 times, thus accumulating 20 samples, each 
of size 15. Make sure to start over each time; for example, if you used the 
method of drawing symbols from a paper bag, then put the symbols back into 
the bag after each sample of size 15 is drawn so they are available for the next 
sample as well. 


Step 4: Create a stemplot or histogram of your 20 sample proportions. Compute 
the mean. 


Step 5: Explain what the Rule for Sample Proportions tells you to expect for 
this situation. 


Step 6: Compare your results with what the Rule for Sample Proportions tells 
you to expect. Be sure to mention mean, standard deviation, shape, and the in- 
tervals into which you expect 68%, 95%, and almost all of the sample propor- 
tions to fall. 


. The purpose of this mini-project is to help you verify the Rule for Sample 


Means. Suppose you are interested in measuring the average amount of blood 
contained in the bodies of adult women, in ounces. Suppose, in truth, the popu- 
lation consists of the following listed values. (Each value would be repeated mil- 
lions of times, but in the same proportions as they exist in this list.) The actual 
mean and standard deviation for these numbers are 110 ounces and 5 ounces, re- 
spectively. The values are bell-shaped. 
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Population Values for Ounces of Blood in Adult Women 


97 100 101 102 103 103 104 104 104 105 106 

106 106 107 107 108 108 109 109 109 110 110 
110 110 110 111 112 112 112 112 113 113 113 
113 113 113 114 114 114 114 114 114 115 115 
116 116 116 117 118 118 


Step 1: Develop a method for drawing simple random samples from this popu- 
lation. One way to do this is to write each number on a slip of paper, put all the 
slips into a paper bag, shake well, and draw from the bag. If a number occurs 
multiple times, make sure you include it that many times. Make sure you actu- 
ally get random samples. Explain your method. 


Step 2: Draw a random sample of size 9. Calculate and record the mean for 
your sample. 


Step 3: Repeat step 2 a total of 20 times, thus accumulating 20 samples, each 
of size 9. Make sure to start over each time; for example, if you drew numbers 
from a paper bag, put the numbers back after each sample of size 9 so they are 
available for the next sample as well. 


Step 4: Create a stemplot or histogram of your 20 sample means. Compute the 
mean of those sample means. 


Step 5: Explain what the Rule for Sample Means tells you to expect for this 
situation. 


Step 6: Compare your results with what the Rule for Sample Means tells you to 
expect. Be sure to mention mean, standard deviation, shape, and the intervals 
into which you expect 68%, 95%, and almost all of the sample means to fall. 


Reference 


World almanac and book of facts. (1995). Edited by Robert Famighetti. Mahwah, NJ: Funk 
and Wagnalls. 


Estimating Proportions 
with Confidence 


Thought Questions 


1. 


One example we see in this chapter is a 95% confidence interval for the proportion 
of British couples in which the wife is taller than the husband. The interval extends 
from .02 to .08, or 2% to 8%. What do you think it means to say that the interval 
from .02 to .08 represents a 95% confidence interval for the proportion of couples 
in which the wife is taller than the husband? 


. Do you think a 99% confidence interval for the proportion described in Question 1 


would be wider or narrower than the 95% interval given? Explain. 


. In a Yankelovich Partners poll of 1000 adults (USA Today, 20 April 1998), 45% re- 


ported that they believed in “faith healing.” Based on this survey, a “95% confidence 
interval” for the proportion in the population who believe is about 42% to 48%. If 
this poll had been based on 5000 adults instead, do you think the “95% confidence 
interval” would be wider or narrower than the interval given? Explain. 


. How do you think the concept of margin of error, explained in Chapter 4, relates to 


confidence intervals for proportions? As a concrete example, can you determine the 
margin of error for the situation in Question 1 from the information given? In Ques- 
tion 3? 
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20.1 Confidence Intervals 


In the previous chapter, we saw that we get different summary values (such as means 
and proportions) each time we take a sample from a population. We also learned that 
statisticians have been able to quantify the amount by which those sample values are 
likely to differ from each other and from the population. 

In practice, statistical methods are used in situations where only one sample is 
taken, and that sample is used to make a conclusion or an inference about numbers 
(such as means and proportions) for the population from which it was taken. One of 
the most common types of inferences is to construct what is called a confidence in- 
terval, which is defined as 


an interval of values computed from sample data that is almost sure to cover 
the true population number. 


The most common level of confidence used is 95%. In other words, researchers de- 
fine “almost sure” to mean that they are 95% certain. They are willing to take a 5% 
risk that the interval does not actually cover the true value. 

It would be impossible to construct an interval in which we could be 100% con- 
fident unless we actually measured the entire population. Sometimes, as we shall see 
in one of the examples in the next section, researchers employ only 90% confidence. 
In other words, they are willing to take a 10% chance that their interval will not 
cover the truth. 

Methods for actually constructing confidence intervals differ, depending on the 
type of question asked and the type of sample used. In this chapter, we learn to con- 
struct confidence intervals for proportions, and in the next chapter, we learn to con- 
struct confidence intervals for means. If you understand the kinds of confidence 
intervals we study in this chapter and the next, you will understand any other type of 
confidence interval as well. 

In most applications, we never know whether the confidence interval covers the 
truth; we can only apply the long-run frequency interpretation of probability. All we 
can know is that, in the long run, 95% of all confidence intervals tagged with 95% 
confidence will be correct (cover the truth) and 5% of them will be wrong. There is 
no way to know for sure which kind we have in any given situation. A common and 
humorous phrase among statisticians is: “Being a statistician means never having to 
say you’re certain.” 


20.2 Three Examples of Confidence 
Intervals from the Media 


When the media report the results of a statistical study, they often supply the infor- 
mation necessary to construct a confidence interval. Sometimes they even provide a 
confidence interval directly. The most commonly reported information that can be 
used to construct a confidence interval is the margin of error. Most public opinion 


EXAMPLE 1 


EXAMPLE 2 
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polls report a margin of error along with the proportion of the sample that had each 
opinion. To use that information, you need to know this fact: 


To construct a 95% confidence interval for a population proportion, simply 
add and subtract the margin of error to the sample proportion. 


The margin of error is often reported using the symbol “+,” which is read “plus or 
minus.” The formula for a 95% confidence interval can thus be expressed as 


sample proportion + margin of error 


Let’s examine three examples from the media, in which confidence intervals are 
either reported directly or can easily be derived. 


A Public Opinion Poll 


In a poll reported in The Sacramento Bee (19 November 2003, p. A20), 54% of respon- 
dents agreed that gay and lesbian couples could be good parents. The report also gave 
this information about the poll: 


Source: Pew Research Center for the People and the Press survey of 1,515 U.S. 
adults, Oct. 15-19; margin of error 3 percentage points. 


What proportion of the entire adult population at that time would agree that gay and 
lesbian couples could be good parents? A 95% confidence interval for that proportion 
can be found by taking 


sample proportion + margin of error 
54% + 3% 
51% to 57% 


Notice that this interval does not cover 50%; it resides completely above 50%. There- 
fore, it would be fair to conclude, with high confidence, that a majority of Americans in 
2003 believed that gay and lesbian couples can be good parents. a 


Number of AIDS Cases in the United States 


An Associated Press story reported in the Davis (CA) Enterprise (14 December 1993, 
p. A-5) was headlined, “Rate of AIDS infection in U.S. may be declining.” The story 
noted that previous estimates of the number of cases of HIV infection in the United 
States were based on mathematical projections working backward from the known 
cases, rather than on a survey of the general population. The article reported that 


For the first time, survey data is now available on a randomly chosen cross-section 
of Americans. Conducted by the National Center for Health Statistics, it concludes 
that 550,000 Americans are actually infected. 


One of the main purposes of the research reported in the story was to estimate how 
many Americans were currently infected with HIV, the virus thought to cause AIDS. 
Some estimates had been as high as 10 million people, but the article noted that the 
Centers for Disease Control (CDC) had estimated the number at about 1 million. Could 
the results of this survey rule out the fact that the number of people infected may be as 
high as 10 million? The way to answer the question is with a confidence interval for the 
true number who are infected, and the article proceeded to report exactly that. 
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Dr. Geraldine McQuillan, who presented the analysis, said the statistical margin of 
error in the survey suggests that the true figure is probably between 300,000 and 
just over 1 million people. “The real number may be a little under or a bit over the 
CDC estimate” [of 1 million], she said, “but it is not 10 million.” 


Notice that the article reports a confidence interval, but does not call it that. The ar- 
ticle also does not report the level of confidence, but it is probably 95%, the default 
value used by most statisticians. You may also have noted that the interval is not the sim- 
ple form of “sample value + margin of error.” Although the methods used to form the 
confidence interval in this example were more complicated, the interpretation is just as 
simple. With high confidence, we can conclude that the true number of HIV-infected in- 
dividuals at the time of the study was between 300,000 and 1 million. E 


Text not available due to copyright restrictions 
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20.3 Constructing a Confidence Interval for a Proportion 


EXAMPLE 4 


EXAMPLE 5 


EXAMPLE 6 


You can easily learn to construct your own confidence intervals for some simple sit- 
uations. One of those situations is the one we encountered in the previous chapter, 
in which a simple random sample is taken for a categorical variable. It is easy to con- 
struct a confidence interval for the proportion of the population who fall into one of 
the categories. Following are some examples of situations where this would apply. 
After presenting the examples, we develop the method, and then return to the exam- 
ples to compute confidence intervals. 


How Often Is the Wife Taller than the Husband? 


In Chapter 10, we displayed data representing the heights of husbands and wives for a ran- 
dom sample of 200 British couples. From that set of data, we can count the number of 
couples for whom the wife is taller than the husband. We can then construct a confidence 
interval for the true proportion of British couples for whom that would be the case. E 


An Experiment in Extrasensory Perception 


In Chapter 22, we will describe in detail an experiment that was conducted to test for 
extrasensory perception (ESP). For one part of the experiment, subjects were asked to 
describe a video being watched by a “sender” in another room. The subjects were then 
shown four videos and asked to pick the one they thought the “sender” had been 
watching. Without ESP, the probability of a correct guess should be .25, or one-fourth, 
because there were four equally likely choices. We will use the data from the experiment 
to construct a confidence interval for the true probability of a correct guess and see if 
the interval includes the .25 value that would be expected if there were no ESP. E 


The Proportion Who Would Quit Smoking with a Nicotine Patch 


In Case Study 5.1, we examined an experiment in which 120 volunteers were given nico- 
tine patches. After 8 weeks, 55 of them had quit smoking. Although the volunteers were 
not a random sample from a population, we can estimate the proportion of people who 
would quit if they were recruited and treated exactly as these individuals were treated. m 


Developing the Formula for a 95% Confidence Interval 


We develop the formula for a 95% confidence interval only and discuss what we 
would do differently if we wanted higher or lower confidence. 


The formula will follow directly from the Rule for Sample Proportions: 

If numerous samples or repetitions of the same size are taken, the frequency 
curve made from proportions from the various samples will be approximately 
bell-shaped. The mean will be the true proportion from the population. The 
standard deviation will be 


the square root of: (true proportion) X (1 — true proportion)/(sample size) 


378 


PART 4 Making Judgments from Surveys and Experiments 


EXAMPLE 4 
CONTINUED 


Because the possible sample proportions are bell-shaped, we can make the fol- 
lowing statement: 


In 95% of all samples, the sample proportion will fall within 2 standard devia- 
tions of the mean, which is the true proportion for the population. 


This statement allows us to easily construct a 95% confidence interval for the true 
population proportion. Notice that we can rewrite it slightly, as follows: 


In 95% of all samples, the true proportion will fall within 2 standard deviations 
of the sample proportion. 


In other words, if we simply add and subtract 2 standard deviations to the sample 
proportion, in 95% of all cases we will have captured the true population proportion. 

There is just one hurdle left. If you examine the Rule for Sample Proportions to 
find the standard deviation, you will notice that it uses the “true proportion.” But we 
don’t know the true proportion; in fact, that’s what we are trying to estimate. There is 
a simple solution to this dilemma. We can get a fairly accurate answer if we substitute 
the sample proportion for the true proportion in the formula for the standard deviation. 


Putting all of this together, here is the formula for a 95% confidence interval 
for a population proportion: 


sample proportion + 2(SD) 
where 


SD = the square root of: 
(sample proportion) X (1 — sample proportion)/(sample size) 


A technical note: To be exact, we would actually add and subtract 1.96(SD) instead 
of 2(SD) because 95% of the values for a bell-shaped curve fall within 1.96 standard 
deviations of the mean. However, in most practical applications, rounding 1.96 off 
to 2.0 will not make much difference and this is common practice. 


Continuing the Examples 


Let us now apply this formula to the examples presented at the beginning of this 
section. 


How Often Is the Wife Taller than the Husband? 
The data presented in Chapter 10, on the heights of 200 British couples, showed that 
in only 10 couples was the wife taller than the husband. Therefore, we find the follow- 
ing numbers: 

sample proportion = 10/200 = .05, or 5% 

standard deviation = square root of (.05)(.95)/200 = .015 

confidence interval = .05 + 2(.015) = .05 + .03 = .02 to .08 


EXAMPLE 5 
CONTINUED 


EXAMPLE 6 
CONTINUED 
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In other words, we are 95% confident that of all British couples, between .02 (2%) and 
.08 (8%) are such that the wife is taller than her husband. E 


An Experiment in Extrasensory Perception 

The data we will examine in detail in Chapter 22 include 165 cases of experiments in 
which a subject tried to guess which of four videos the “sender” was watching in an- 
other room. Of the 165 cases, 61 resulted in successful guesses. Therefore, we find the 
following numbers: 


sample proportion = 61/165 = .37, or 37% 
standard deviation = square root of (.37)(.63)/165 = .038 
confidence interval = .37 + 2(.038) = .37 + .08 = .29 to .45 


In other words, we are 95% confident that the probability of a successful guess in this 
situation is between .29 (29%) and .45 (45%). Notice that this interval lies entirely 
above the 25% value expected by chance. a 


The Proportion Who Would Quit Smoking with a Nicotine Patch 

In Case Study 5.1, we learned that of 120 volunteers randomly assigned to use a nico- 
tine patch, 55 of them had quit smoking after 8 weeks. We use this information to es- 
timate the probability that a smoker recruited and treated in an identical fashion would 
quit smoking after 8 weeks: 


sample proportion = 55/120 = .46, or 46% 
standard deviation = square root of (.46)(.54)/120 = .045 
confidence interval = .46 + 2(.045) = .46 + .09 = .37 to .55 


In other words, we are 95% confident that between 37% and 55% of smokers treated 
in this way would quit smoking after 8 weeks. Remember that a placebo group was in- 
cluded for this experiment, in which 24 people, or 20%, quit smoking after 8 weeks. A 
confidence interval surrounding that value runs from 13% to 27% and thus does not 
overlap with the confidence interval for those using the nicotine patch. a 


Other Levels of Confidence 


If you wanted to present a narrower interval, you would have to settle for less confi- 
dence. Applying the reasoning we used to construct the formula for a 95% confidence 
interval and using the information about bell-shaped curves from Chapter 8, we could 
have constructed a 68% confidence interval, for example. We would simply add and sub- 
tract 1 standard deviation to the sample proportion instead of 2. Similarly, if we added 
and subtracted 3 standard deviations, we would have a 99.7% confidence interval. 

Although 95% confidence intervals are by far the most common, you will some- 
times see 90% or 99% intervals as well. To construct those, you simply replace the 
value 2 in the formula with 1.645 for a 90% confidence interval or with the value 
2.576 for a 99% confidence interval. 


How the Margin of Error Was Derived in Chapter 4 


We have already noted that you can construct a 95% confidence interval for a pro- 
portion if you know the margin of error. You simply add and subtract the margin of 
error to the sample proportion. 
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CASE STUDY 20.1 


Polls, such as the Pew Research Center poll in Example 1, generally use a mul- 
tistage sample (see Chapter 4). Therefore, the simple formulas for confidence inter- 
vals given in this chapter, which are based on simple random samples, do not give 
exactly the same answers as those using the margin of error stated. For polls based 
on multistage samples, it is more appropriate to use the stated margin of error than 
to use the formula given in this chapter. 

In Chapter 4, we presented an approximate method for computing the margin of 
error. Using the letter to represent the sample size, we then said that a conservative 
way to compute the margin of error was to use 1/ Vn. Thus, we now have two ap- 
parently different formulas for finding a 95% confidence interval: 


sample proportion + margin of error = sample proportion + 1/ Vn 
or 
sample proportion + 2(SD) 


How do we reconcile the two different formulas? In order to reconcile them, it 
should follow that 


margin of error = 1/ Vn = XSD) 


It turns out that these two formulas are equivalent when the proportion used in the 
formula for SD is .5. In other words, when 


standard deviation = SD = square root of (.5)(.5)/n = (.5)/Vn 


In that case, 2(SD) is simply 1/ Vn, which is our conservative formula for margin 
of error. This is called a conservative formula because the true margin of error is ac- 
tually likely to be smaller. If you use any value other than .5 as the proportion in the 
formula for standard deviation, you will get a smaller answer than you get using .5. 
You will be asked to confirm this fact in Exercise 10 at the end of this chapter. 


A Winning Confidence Interval Loses in Court 


Gastwirth (1988, p. 495) describes a court case in which Sears, Roebuck and Com- 
pany, a large department store chain, tried to use a confidence interval to determine 
the amount by which it had overpaid city taxes at stores in Inglewood, California. 
Unfortunately, the judge did not think the confidence interval was appropriate and 
required Sears to examine all the sales records for the period in question. This case 
study provides an example of a situation where the answer became known, so we can 
compare the results from the sample with the true answer. 

The problem arose because Sears had erroneously collected and paid city sales 
taxes for sales made to individuals outside the city limits. The company discovered 
the mistake during a routine audit, and asked the city for a refund of $27,000, the 
amount by which it estimated it had overpaid. 

Realizing that it needed data to substantiate this amount, Sears decided to take a 
random sample of sales slips for the period in question and then, on the basis of the 
sample proportion, try to estimate the proportion of all sales that had been made to 
people outside of city limits. It used a multistage sampling plan, in which the 
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33-month period was divided into eleven 3-month periods to ensure that seasonal ef- 
fects were considered. It then took a random sample of 3 days in each period, for a 
total of 33 days, and examined all sales slips for those days. 

Based on the data, Sears derived a 95% confidence interval for the true propor- 
tion of all sales that were made to out-of-city customers. The confidence interval was 
367 + .03, or .337 to .397. To determine the amount of tax Sears believed it was 
owed, the percentage of out-of-city sales was multiplied by the total tax paid, which 
was $76,975. The result was $28,250, with a 95% confidence interval extending 
from $25,940 to $30,559. 

The judge did not accept the use of sampling despite testimony from accounting 
experts who noted that it was common practice in auditing. The judge required Sears 
to examine all of the sales records. In doing so, Sears discovered that about one 
month’s worth of slips were missing; however, based on the available slips, it had 
overpaid $26,750.22. This figure is slightly under the true amount due to the miss- 
ing month, but you can see that the sampling method Sears had used provided a 
fairly accurate estimate of the amount it was owed. If we assume that the dollar 
amount from the missing month was similar to those for the months counted, we find 
that the total Sears was owed would have been about $27,586. 

Sampling methods and confidence intervals are routinely used for financial au- 
dits. These techniques have two main advantages over studying all of the records. 
First, they are much cheaper. It took Sears about 300 person-hours to conduct the 
sample and 3384 hours to do the full audit. Second, a sample can be done more care- 
fully than a complete audit. In the case of Sears, it could have two well-trained peo- 
ple conduct the sample in less than a month. The full audit would require either 
having those same two people work for 10 months or training 10 times as many peo- 
ple. As Gastwirth (1988, p. 496) concludes in his discussion of the Sears case, “A 
well designed sampling audit may yield a more accurate estimate than a less care- 
fully carried out complete audit or census.” In fairness, the judge in this case was sim- 
ply following the law; the sales tax return required a sale-by-sale computation. 


For Those Who Like Formulas 


Notation for Population and Sample Proportions (from Chapter 19) 
Sample size = n 
Population proportion = p 


Sample proportion = p 


Notation for the Multiplier for a Confidence Interval 

For reasons that will become clear in later chapters, we specify the level of confi- 
dence for a confidence interval as (1 — æ (read “alpha’’)) X 100%. For example, for 
a 95% confidence interval, a = .05. Let z4/2 = standardized normal score with area 
a/2 above it. Then the area between Ze/2 and —Za/2 İs 1 — a. For example, when 
a = .05, as for a 95% confidence interval, za/2 = 1.96, or about 2. 


382 PART 4 Making Judgments from Surveys and Experiments 


Formula for a (1 — œ) 100% Confidence Interval for a Proportion 


E [pC — P) 
DEus ———_ 
n 


Common Values of za/2 


1.0 for a 68% confidence interval 


1.96 or 2.0 for a 95% confidence interval 
1.645 for a 90% confidence interval 
2.576 for a 99% confidence interval 


3.0 for a 99.7% confidence interval 


Exercises Asterisked (*) exercises are included in the Solutions at the back of the book. 


1. An advertisement for Seldane-D, a drug prescribed for seasonal allergic rhinitis, 
reported results of a double-blind study in which 374 patients took Seldane-D 
and 193 took a placebo (Time, 27 March 1995, p. 18). Headaches were reported 
as a side effect by 65 of those taking Seldane-D. 


a. 
b. 
c. 


d. 


What is the sample proportion of Seldane-D takers who reported headaches? 
What is the standard deviation for the proportion computed in part a? 


Construct a 95% confidence interval for the population proportion based on 
the information from parts a and b. 


Interpret the confidence interval from part c by writing a few sentences ex- 
plaining what it means. 


*2. Refer to Exercise 1. Of the 193 placebo takers, 43 reported headaches. 


*a, 


b. 


Compute a 95% confidence interval for the true population proportion that 
would get headaches after taking a placebo. 


Notice that a higher proportion of placebo takers than Seldane-D takers 
reported headaches. Use that information to explain why it is important to 
have a group taking placebos when studying the potential side effects of 
medications. 


3. On September 10, 1998, the “Starr Report,” alleging impeachable offenses by 
President Bill Clinton, was released to Congress. That evening, the Gallup Or- 
ganization conducted a poll of 645 adults nationwide to assess initial reaction 
(reported at www.gallup.com). One of the questions asked was: “Based on what 
you know at this point, do you think that Bill Clinton should or should not be 
impeached and removed from office?” The response “Yes, should” was selected 
by 31% of the respondents. 


a. 


The Gallup Web page said, “For results based on the total sample of adults 
nationwide, one can say with 95% confidence that the margin of sampling 
error is no greater than + 4 percentage points.” Explain what this means and 
verify that the statement is accurate. 


4. 


*5, 


aa 
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b. Give a 95% confidence interval for the proportion of all adults who would 
have said President Clinton should be impeached had they been asked that 
evening. 

c. A similar Gallup Poll taken in early June 1998 found that 19% responded 
that President Clinton should be impeached. Do you think the difference be- 
tween the results of the two polls can be attributed to chance variation in the 
samples taken, or does it represent a real difference of opinion in the popu- 
lation in June versus mid-September? Explain. 


A telephone poll reported in Time magazine (6 February 1995, p. 24) asked 359 

adult Americans the question, “Do you think Congress should maintain or re- 

peal last year’s ban on several types of assault weapons?” Seventy-five percent 
responded “maintain.” 

a. Compute the standard deviation for the sample proportion of .75. 

b. Time reported that the “sampling error is + 4.5%.” Verify that 4.5% is ap- 
proximately what would be added and subtracted to the sample proportion to 
create a 95% confidence interval. 

c. Use the information reported by Time to create a 95% confidence interval for 
the population proportion. Interpret the interval in words that would be un- 
derstood by someone with no training in statistics. Be sure to specify the 
population to which it applies. 

What level of confidence would accompany each of the following intervals? 

*a, Sample proportion + 1.0(SD) 
*b. Sample proportion + 1.645(SD) 
c. Sample proportion + 1.96(SD) 
d. Sample proportion + 2.576(SD) 


. Explain whether the width of a confidence interval would increase, decrease or 


remain the same as a result of each of the following changes: 

a. The sample size is doubled, from 400 to 800. 

b. The population size is doubled, from 25 million to 50 million. 
c. The level of confidence is lowered from 95% to 90%. 


Parade Magazine reported that “nearly 3200 readers dialed a 900 number to re- 
spond to a survey in our Jan. 8 cover story on America’s young people and vio- 
lence” (19 February 1995, p. 20). Of those responding, “63.3% say they have 
been victims or personally know a victim of violent crime.” Can the results 
quoted and methods in this chapter legitimately be used to compute a 95% con- 
fidence interval for the proportion of Americans who fit that description? If so, 
compute the interval. If not, explain why not. 


. Refer to Example 6 in this chapter. It is claimed that a 95% confidence interval 


for the percentage of placebo-patch users who quit smoking by the eighth week 
covers 13% to 27%. There were 120 placebo-patch users, and 24 quit smoking 
by the eighth week. Verify that the confidence interval given is correct. 


. Find the results of a poll reported in a weekly newsmagazine such as Newsweek 


or Time, in a newspaper such as the New York Times, or on the Internet in which 
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10. 


11. 


12. 


a margin of error is also reported. Explain what question was asked and what 
margin of error was reported; then present a 95% confidence interval for the re- 
sults. Explain in words what the interval means for your example. 


Confirm that the standard deviation for sample proportions is largest when the 
proportion used to calculate it is 50. Do this by using other values above and 
below .50 and comparing the answers to what you would get using .50. Try three 
values above and three values below .50. 


A university is contemplating switching from the quarter system to the semes- 
ter system. The administration conducts a survey of a random sample of 400 
students and finds that 240 of them prefer to remain on the quarter system. 


a. Construct a 95% confidence interval for the true proportion of all students 
who would prefer to remain on the quarter system. 


b. Does the interval you computed in part a provide convincing evidence that 
the majority of students prefer to remain on the quarter system? Explain. 


c. Now suppose that only 50 students had been surveyed and that 30 said they 
preferred the quarter system. Compute a 95% confidence interval for the true 
proportion who prefer to remain on the quarter system. Does the interval pro- 
vide convincing evidence that the majority of students prefer to remain on 
the quarter system? 


d. Compare the sample proportions and the confidence intervals found in parts 
a and c. Use these results to discuss the role sample size plays in helping 
make decisions from sample data. 


In a special double issue of Time magazine, the cover story featured Pope John 

Paul II as “man of the year” (26 December 1994-2 January 1995, pp. 74-76). 

As part of the story, Time reported on the results of a survey of 507 adult Amer- 

ican Catholics, taken by telephone on December 7-8. It was also reported that 

“sampling error is + 4.4%.” 

a. One question asked was, “Do you favor allowing women to be priests?” to 
which 59% of the respondents answered yes. Using the reported margin of 
error of 4.4%, calculate a 95% confidence interval for the response to this 
question. Write a sentence interpreting the interval that could be understood 
by someone who knows nothing about statistics. Be careful about specifying 
the correct population. 


b. Calculate a 95% confidence interval for the question in part a, using the for- 
mula in this chapter rather than the reported margin of error. Compare your 
answer to the answer in part a. 


c. Another question in the survey was, “Is it possible to disagree with the Pope 
and still be a good Catholic?” to which 89% of respondents said yes. Using 
the formula in this chapter, compute a 95% confidence interval for the true 
proportion who would answer yes to the question. Now compute a 95% con- 
fidence interval using the reported margin of error of 4.4%. Compare your 
two intervals. 


d. If you computed your intervals correctly, you would have found that the two 
intervals in parts a and b were quite similar to each other, whereas the two 


13. 


14. 


*15. 
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intervals in part c were not. In part c, the interval computed using the re- 
ported margin of error was wider than the one computed using the formula. 
Explain why the two methods for computing the intervals agreed more 
closely for the survey question in parts a and b than for the survey question 
in part c. 
U.S. News and World Report (19 December 1994, pp. 62-71) reported on a sur- 
vey of 1000 American adults, conducted by telephone on December 2—4, 1994, 
designed to measure beliefs about apocalyptic predictions. They reported that 
the margin of error was “+3 percentage points.” 


a. Verify that the margin of error for a sample of size 1000 is as reported. 


b. One of the results reported was that 59% of Americans believe the world will 
come to an end. Construct a 95% confidence interval for the true percentage 
of Americans with that belief, using the margin of error given in the article. 
Interpret the interval in a way that could be understood by a statistically 
naive reader. 


Refer to the article discussed in Exercise 13. The article continued by reporting 
that of those who do believe the world will come to an end, 33% believe it will 
happen within either a few years or a few decades. Respondents were only asked 
that question if they answered yes to the question about the world coming to an 
end, so about 590 respondents would have been asked the question. 


a. Consider only those adult Americans who believe the world will come to an 
end. For that population, compute a 95% confidence interval for the propor- 
tion who believe it will come to an end within the next few years or few 
decades. 


b. Explain why you could not use the margin of error of +3% reported in the 
article to compute the confidence interval in part a. 


A study first reported in the Journal of the American Medical Association (7 De- 
cember 1994) received widespread attention as the first wide-scale study of the 
use of alcohol on American college campuses and was the subject of an article 
in Time magazine (19 December 1994, p. 16). The researchers surveyed 17,592 
students at 140 four-year colleges in 40 states. One of the results they found was 
that about 8.8%, or about 1550 respondents, were frequent binge drinkers. They 
defined frequent binge drinking as having had at least four (for women) or five 
(for men) drinks at a single sitting at least three times during the previous 2 
weeks. 


*a, Time magazine (19 December 1994, p. 66) reported that of the approxi- 
mately 1550 frequent binge drinkers in this study, 22% reported having had 
unprotected sex. Find a 95% confidence interval for the true proportion of all 
frequent binge drinkers who had unprotected sex, and interpret the interval 
for someone who has no knowledge of statistics. 


*b. Notice that the results quoted in part a indicate that about 341 students out of 
the 17,592 interviewed said they were frequent binge drinkers and had unpro- 
tected sex. Compute a 95% confidence interval for the proportion of college 
students who are frequent binge drinkers and who also had unprotected sex. 


386 


PART 4 Making Judgments from Surveys and Experiments 


*16. 


17. 


18. 


c. Using the results from parts a and b, write two short news articles on the 
problem of binge drinking and unprotected sex. In one, make the situation 
sound as disastrous as you can. In the other, try to minimize the problem. 


In Example 5 in this chapter, we found a 95% confidence interval for the pro- 
portion of successes likely in a certain kind of ESP test. Construct a 99.7% con- 
fidence interval for that example. Explain why a skeptic of ESP would prefer to 
report the 99.7% confidence interval. 


Refer to the formula for a confidence interval in the For Those Who Like 
Formulas section. 


a. Write the formula for a 90% confidence interval for a proportion. 


b. Refer to Example 6. Construct a 90% confidence interval for the proportion 
of smokers who would quit after 8 weeks using a nicotine patch. 


c. Compare the 90% confidence interval you found in part b to the 95% confi- 
dence interval used in the example. Explain which one you would present if 
your company were advertising the effectiveness of nicotine patches. 


In a poll reported in Newsweek (16 May 1994, p. 23) one of the questions asked 
was, “Is the media paying too much attention to [President] Clinton’s private 
life, too little, or about the right amount of attention?” Results showed that 59% 
answered “too much,” 5% answered “too little,’ and 31% answered “right 
amount.” The article also reported that the poll was based on 518 adults and that 
the margin of error was 5%. 


a. Using the margin of error reported in the article, find a 95% confidence in- 
terval for the proportion of all adults who thought at the time that the media 
was paying too much attention. 


b. Based on the result in part a, could you conclude that a majority of adults at 
the time thought the media was paying too much attention to Clinton’s pri- 
vate life? Explain. 


. Refer to News Story 2 in the Appendix and on the CD, “Research shows women 


harder hit by hangovers,” and the accompanying Original Source 2 on the CD, 
“Development and initial validation of the Hangover Symptoms Scale: Preva- 
lence and correlates of hangover symptoms in college students.” Table 3 of the 
journal article reports that 13% of the 1216 college students in the study said 
that they had not experienced any hangover symptoms in the past year. 


a. Assuming that the participants in this study are a representative sample of 
college students, find a 95% confidence interval for the proportion of college 
students who have not experienced any hangover symptoms in the past year. 
Use the formula in this chapter. 


b. Write a sentence or two interpreting the interval you found in part a that 
would be understood by someone without training in statistics. 


c. The journal article also reported that the study originally had 1474 partici- 
pants, but only 1234 reported drinking any alcohol in the past year. (Only 
those who reported drinking in the past year were retained for the hangover 
symptom questions.) Use this information to find a 95% confidence interval 
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for the proportion of all students who would report drinking any alcohol in 
the past year. 


d. Refer to the journal article to determine how students were selected for this 
study. Based on that information, to what population of students do you think 
the intervals in this exercise apply? Explain. 


Refer to News Story 13 and the accompanying report on the CD, “2003 CASA 
National Survey of American Attitudes on Substance Abuse VIII: Teens and 
Parents.” 


a. The margin of error for the teens and for the parents are reported in the news 
story. What are they reported to be? 


b. Refer to page 30 of Original Source 13 report. The margins of error are re- 
ported there as well. What are they reported to be? Are they the same as those 
reported in the news story? Explain. 


*c. The 1987 teens in the survey were asked, “How harmful to the health of 
someone your age is the regular use of alcohol—very harmful, fairly harm- 
ful, not too harmful or not harmful at all?” Forty-nine percent responded that 
it was very harmful. Find a 95% confidence interval for the proportion of all 
teens who would respond that way. 


d. The 504 parents in the survey were asked, “How harmful to the health of a 
teenager is the regular use of alcohol—very harmful, fairly harmful, not too 
harmful or not harmful at all?” Seventy-seven percent responded that it was 
very harmful. Find a 95% confidence interval for the proportion of all par- 
ents (similar to those in this study) who would respond that way. 


e. Compare the confidence intervals in parts c and d. In particular, do they in- 
dicate that there is a difference in the population proportions of teens and 
parents who think alcohol is very harmful to teens? 


f. Write a short news story reporting the results you found in parts c to e. 


Mini-Projects 


. You are going to use the methods discussed in this chapter to estimate the pro- 


portion of all cars in your area that are red. Stand on a busy street and count cars 

as they pass by. Count 100 cars and keep track of how many are red. 

a. Using your data, compute a 95% confidence interval for the proportion of 
cars in your area that are red. 

b. Based on how you conducted the survey, are any biases likely to influence 
your results? Explain. 


. Collect data and construct a confidence interval for a proportion for which you 


already know the answer. Use a sample of at least 100. You can select the situ- 
ation for which you would like to do this. For example, you could flip a coin 100 
times and construct a confidence interval for the proportion of heads, knowing 
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that the true proportion is .5. Report how you collected the data and the results 
you found. Explain the meaning of your confidence interval and compare it to 
what you know to be the truth about the proportion of interest. 


3. Choose a categorical variable for which you would like to estimate the true pro- 
portion that fall into a certain category. Conduct an experiment or a survey that 
allows you to find a 95% confidence interval for the proportion of interest. Ex- 
plain exactly what you did, how you computed your results, and how you would 
interpret the results. 
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The Role of Confidence 


Intervals in Research 


Thought Questions 


1. In this chapter, Example 1 compares weight loss (over 1 year) in men who diet but do 


not exercise, and vice versa. The results show that a 95% confidence interval for the 
mean weight loss for men who diet but do not exercise extends from 13.4 to 18.3 
pounds. A 95% confidence interval for the mean weight loss for men who exercise 
but do not diet extends from 6.4 to 11.2 pounds. 


a. Do you think this means that 95% of all men who diet will lose between 13.4 and 
18.3 pounds? Explain. 


b. On the basis of these results, do you think you can conclude that men who diet 
without exercising lose more weight, on average, than men who exercise but do 
not diet? 


. The first confidence interval in Question 1 was based on results from 42 men. The 
confidence interval spans a range of almost 5 pounds. If the results had been based 
on a much larger sample, do you think the confidence interval for the mean weight 
loss would have been wider, narrower, or about the same? Explain your reasoning. 


. In Question 1, we compared average weight loss for dieting and for exercising by 
computing separate confidence intervals for the two means and comparing the in- 
tervals. What would be a more direct value to examine to make the comparison be- 
tween the mean weight loss for the two methods? 


. In Case Study 5.3, we examined the relationship between baldness and heart attacks. 
Many of the results reported in the original journal article were expressed in terms of 
relative risk of a heart attack for men with severe vertex baldness compared to men 
with no hair loss. One result reported was that a 95% confidence interval for the rel- 
ative risk for men under 45 years of age extended from 1.1 to 8.2. 


a. Recalling the material from Chapter 12, explain what it means to have a relative 
risk of 1.1 in this example. 


b. Interpret the result given by the confidence interval. 
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21.1 Confidence Intervals for Population Means 


EXAMPLE 1 


In Chapter 19, we learned what to expect of sample means, assuming we knew the 
mean and the standard deviation of the population from which the sample was 
drawn. In this section, we try to estimate a population mean when all we have avail- 
able is a sample of measurements from the population. All we need from the sample 
are its mean, standard deviation, and number of observations. 


Do Men Lose More Weight by Diet or by Exercise? 


Wood and colleagues (1988), also reported by Iman (1994, p. 258), studied a group of 
89 sedentary men for a year. Forty-two men were placed on a diet; the remaining 47 
were put on an exercise routine. 

The group on a diet lost an average of 7.2 kg, with a standard deviation of 3.7 kg. 
The men who exercised lost an average of 4.0 kg, with a standard deviation of 3.9 kg 
(Wood et al., 1988, Table 2). Before we discuss how to compare the groups, let’s deter- 
mine how to extend the sample results to what would happen if the entire population 
of men of this type were to diet or exercise exclusively. We will return to this example 
after we learn the general method. E 


The Rule for Sample Means, Revisited 


In Chapter 19, we learned how sample means behave. 


The Rule for Sample Means is 
If numerous samples of the same size are taken, the frequency curve of means 
from the various samples will be approximately bell-shaped. The mean of this 
collection of sample means will be the same as the mean of the population. 
The standard deviation will be 


population standard deviation/square root of sample size 


Standard Error of the Mean 

Before proceeding, we need to distinguish between the population standard devia- 
tion and the standard deviation for the sample means, which is the population stan- 
dard deviation/ Vn. (Recall that n is the number of observations in the sample.) 
Consistent with the distinction made by most researchers, we use terminology as fol- 
lows for these two different kinds of standard deviations. 


The standard deviation for the possible sample means is called the standard 
error of the mean. It is sometimes abbreviated as SEM or just “standard er- 
ror.” In other words, 


SEM = standard error = population standard deviation/ Vn 
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In practice, the population standard deviation is usually unknown and is replaced 
by the sample standard deviation, computed from the data. The term standard error 
of the mean or standard error is still used. 


Population versus Sample Standard Deviation and Error 

An example will help clarify the distinctions among these terms. In Chapter 19, we 
considered a hypothetical population of people who visited a weight-loss clinic. We 
said that the weight losses for the thousands of people in the population were bell- 
shaped, with a mean of 8 pounds and a standard deviation of 5 pounds. Further, we 
considered samples of n = 25 people. For one sample, we found the mean and stan- 
dard deviation for the 25 people to be mean = 8.32 pounds, standard deviation = 
4.74 pounds. 

Thus, we have the following numbers: 


population standard deviation = 5 pounds 

sample standard deviation = 4.74 pounds 

standard error of the mean (using population SD) = 5/ V25 =1 
standard error of the mean (using sample SD) = 4.74/ V25 = 0.95 


Let’s now return to our discussion of the Rule for Sample Means. It is important to 
remember the conditions under which this rule applies: 


1. The population of measurements of interest is bell-shaped, and a random sample 
of any size is measured. 
or 


2. The population of measurements of interest is not bell-shaped, but a large ran- 
dom sample is measured. A sample of size 30 is usually considered “large,” but 
if there are extreme outliers, it is better to have a larger sample. 


Constructing a Confidence Interval for a Mean 


We can use the same reasoning we used in Chapter 20, where we constructed a 95% 
confidence interval for a proportion, to construct a 95% confidence interval for a 
mean. The Rule for Sample Means and the Empirical Rule from Chapter 8 allow us 
to make the following statement: 


In 95% of all samples, the sample mean will fall within 2 standard errors of 
the true population mean. 


Now let’s rewrite the statement in a more useful form: 


In 95% of all samples, the true population mean will be within 2 standard er- 
rors of the sample mean. 


In other words, if we simply add and subtract 2 standard errors to the sample mean, 
in 95% of all cases we will have captured the true population mean. 
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EXAMPLE 1 
CONTINUED 


Putting this all together, here is the formula for a 95% confidence interval for 
a population mean: 


sample mean + 2 standard errors 
where 


standard error = standard deviation/ Vn 


Important note: This formula should be used only if there are at least 30 observa- 
tions in the sample. To compute a 95% confidence interval for the population mean 
based on smaller samples, a multiplier larger than 2 is used, which is found from a 
“t-distribution.” The technical details involved are beyond the scope of this book. 
However, if someone else has constructed the confidence interval for you based on 
a small sample, the interpretation discussed here is still valid. 


Comparing Diet and Exercise for Weight Loss 


Let's construct a 95% confidence interval for the mean weight losses for all men who 
might diet or who might exercise, based on the sample information given in Example 1. 
We will be constructing two separate confidence intervals, one for each condition. Notice 
the switch from kilograms to pounds at the end of the computation; 2.2 kg = 1 lb. The 
results could be expressed in either unit, but pounds are more familiar to many readers. 


Diet Only Exercise Only 

sample mean = 7.2 kg sample mean = 4.0 kg 

sample standard deviation = 3.7 kg sample standard deviation = 3.9 kg 
number of participants = n = 42 number of participants = n = 47 
standard error = 3.7/V42 = 0.571 standard error = 3.9/V 47 = 0.569 
2 X standard error = 2(0.571) = 1.1 2 X standard error = 2(0.569)=1.1 


95% confidence interval for the population mean: 
sample mean + 2 X standard error 


Diet Only Exercise Only 


Ta E fl 4.0 + 1.1 
6.1 kg to 8.3 kg 2.9 kg to 5.1 kg 
13.4 Ib to 18.3 Ib 6.4 lb to 11.2 Ib 


These results indicate that men similar to those in this study would lose an average 
of somewhere between 13.4 and 18.3 pounds on a diet but would lose only an aver- 
age of 6.4 to 11.2 pounds with exercise alone. Notice that these intervals are trying to 
capture the true mean or average value for the population. They do not encompass the 
full range of weight loss that would be experienced by most individuals. Also, remem- 
ber that these intervals could be wrong. Ninety-five percent of intervals constructed this 
way will contain the correct population mean value, but 5% will not. We will never 
know which are which. 

Based on these results, it appears that dieting probably results in a larger weight loss 
than exercise because there is no overlap in the two intervals. Comparing the endpoints 
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of these intervals, we are fairly certain that the average weight loss from dieting is no 
lower than 13.4 pounds and the average weight loss from exercising no higher than 
11.2 pounds. In the next section, we learn a more efficient method for making the com- 
parison, one that will enable us to estimate with 95% confidence the actual difference 
in the two averages for the population. a 


21.2 Confidence Intervals for the Difference 
Between Two Means 


In many instances, such as in the preceding example, we are interested in compar- 
ing the population means under two conditions or for two groups. One way to do that 
is to construct separate confidence intervals for the two conditions and then compare 
them. That’s what we did in Section 21.1 for the weight-loss example. A more direct 
and efficient approach is to construct a single confidence interval for the difference 
in the population means for the two groups or conditions. In this section, we learn 
how to do that. 

You may have noticed that the formats are similar for the two types of confidence 
intervals we have discussed so far. That is, they were both used to estimate a popu- 
lation value, either a proportion or a mean. They were both built around the corre- 
sponding sample value, the sample proportion or the sample mean. They both had 
the form: 


sample value + 2 X measure of variability 


This format was based on the fact that the “sample value” over repeated samples was 
predicted to follow a bell-shaped curve centered on the “population value.” All we 
needed to know in addition was the “standard deviation” for that specific bell-shaped 
curve. The Empirical Rule from Chapter 8 tells us that an interval spanning 2 stan- 
dard deviations on either side of the center will cover 95% of the possible values. 

The same is true for calculating a 95% confidence interval for the difference in 
two means. Here is the recipe you follow: 


Constructing a 95% Confidence Interval for the Difference in Means 


1. Collect a large sample of observations (at least 30), independently, under 
each condition or from each group. Compute the mean and the standard 
deviation for each sample. 


2. Compute the standard error of the mean (SEM) for each sample by divid- 
ing the sample standard deviation by the square root of the sample size. 

3. Square the two SEMs and add them together. Then take the square root. 
This will give you the necessary “measure of variability,” which is called 
the standard error of the difference in two means. In other words, 


measure of variability = square root of [(SEM,)” + (SEM>)*] 
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EXAMPLE 2 


4. A 95% confidence interval for the difference in the two population means is 
difference in sample means + 2 X measure of variability 
or 


difference in sample means + 2 X square root of [(SEM,)*+ (SEM>)*] 


A Direct Comparison of Diet and Exercise 


We are now in a position to compute a 95% confidence interval for the difference in 
population means for weight loss from dieting only and weight loss from exercising only. 
Let's follow the steps outlined, using the data from the previous section. 


Steps 1 and 2. Compute sample means, standard deviations, and SEMs: 


Diet Only Exercise Only 

sample mean = 7.2 kg sample mean = 4.0 kg 

sample standard deviation = 3.7 kg sample standard deviation = 3.9 kg 

number of participants = n = 42 number of participants = n = 47 

standard error = SEM, = standard error = SEM, = 
3.7/V47 = 0.571 3.9/V47 = 0.569 


Step 3. Square the two standard errors and add them together. Take the square root: 
measure of uncertainty = square root of [(0.571)? + (0.569)?] = 0.81 


Step 4. Compute the interval. A 95% confidence interval for the difference in the two 
population means is 


difference in sample means + 2 X measure of variability 
(7.2 — 4.0) + 2(0.81) 
3.2 + 1.6 
1.6 kg to 4.8 kg 
3.5 Ib to 10.6 Ib 


Notice that this interval is entirely above zero. Therefore, we can be highly confident that 
there really is a difference in average weight loss, with higher weight loss for dieting 
alone than for exercise alone. In other words, we are 95% confident that the interval 
captures the true difference in means and that the true difference is at least 3.5 pounds. 
Remember that this interval estimates the difference in averages, not in weight losses 
for individuals. E 


A Caution about Using This Method 


The method described in this section is valid only when independent measurements 
are taken from the two groups. For instance, if matched pairs are used and one treat- 
ment is randomly assigned to each half of the pair, the measurements would not be 
independent. In that case, differences should be taken for each pair of measurements, 
and then a confidence interval computed for the mean of those differences. Assum- 
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ing the matched observations were positively correlated, using the method in this 
section would result in a measure of variability that was too large. 


21.3 Revisiting Case Studies: How Journals 
Present Confidence Intervals 


Many of the case studies we examined in the early part of this book involved mak- 
ing conclusions about differences in means. The original journal articles from which 
those case studies were drawn each presented results in a slightly different way. 
Some provided confidence intervals directly; others gave the information necessary 
for you to construct your own interval. In this section, we revisit some of those case 
studies as examples of the kinds of information researchers provide for readers. 


Direct Reporting of Confidence Intervals: 


Case Study 6.4 


Case Study 6.4 examined the relationship between smoking during pregnancy and 
subsequent IQ of the child. The journal article in which that study was reported 
(Olds, Henderson, and Tatelbaum, 1994) provided 95% confidence intervals to ac- 
company all the results it reported. Most of the confidence intervals were based on 
a comparison of the means for mothers who didn’t smoke and mothers who smoked 
10 or more cigarettes per day, hereafter called “smokers.” Some of the results were 
presented in tables and others were presented in the text. Table 21.1 gives some of 
the results from the tables contained in the paper. 

Let’s interpret these results. The first confidence interval compares the mean ed- 
ucational levels for the smokers and nonsmokers. The result tells us that, in this sam- 
ple, the average educational level for nonsmokers was 0.67 year higher than for 
smokers. The confidence interval extends this value to what the difference might be 
for the populations from which these samples were drawn. The interval tells us that 
the difference in the population is probably between 0.15 and 1.19 years of educa- 
tion. In other words, mothers who did not smoke were also likely to have had more 
education. Maternal education was a confounding variable in this study and was part 


Table 21.1 Some 95% Confidence Intervals from Case Study 6.4 


Sample Means 


0 Cigarettes 10+ Cigarettes Difference (95% Cl) 


Maternal education, grades 11.57 10.89 0.67 (0.15,1.19) 
Stanford-Binet (IQ), 48 mo 113.28 103.12 10.16 (5.04, 15.30) 
Birthweight, g 3416 3035 381.0 (167.1,594.9) 


Source: “Study: Smoking May Lower Kids’ IQs.” Associated Press, February 11, 1994. Reprinted with permission. 
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of what the researchers used to try to explain the differences observed in the chil- 
dren’s IQs for the two groups. 

The second row of Table 21.1 compares the mean IQs for the children of the non- 
smokers and smokers at 48 months of age. The difference in means for the sample 
was 10.16 points. From this, the researchers inferred that there is probably a differ- 
ence of somewhere between 5.04 and 15.30 points for the entire population. In other 
words, the children of nonsmokers in the population probably have IQs that are be- 
tween 5.04 and 15.30 points higher than the children of mothers who smoke 10 or 
more cigarettes per day. 

The third row of Table 21.1 (birthweight) represents an example of the kind of 
explanatory confounding variables that may have been present. Smoking may have 
caused lower birthweights, which in turn may have caused lower IQs. The result 
shown here is that the average difference in birthweight for babies of nonsmokers 
and smokers in the sample was 381 grams. Further, with 95% confidence, we can 
state that there could be a difference as low as 167.1 grams or as high as 594.9 grams 
for the population from which this sample was drawn. Thus, we are fairly certain that 
mothers in the population who smoke are likely to have babies with lower birth- 
weight, on average, than those who don’t. 

Olds and colleagues (1994) also included numerous confidence intervals in the 
text. For example, they realized that they needed to control for confounding vari- 
ables to make a more realistic comparison between the IQs of children of smokers 
and nonsmokers. They wrote: “After control for confounding background variables 
(Table 3), the average difference observed at 12 and 24 months was 2.59 points (95% 
CI: —3.03, 8.20); the difference observed at 36 and 48 months was reduced to 4.35 
points (95% CI: 0.02, 8.68)” (pp. 223-224). 

The result, as reported in the news story quoted in Chapter 6, was that the gap in 
average IQ at 3 and 4 years of age narrowed to 4 points when “a wide range of in- 
terrelated factors were controlled.” You are now in an excellent position to under- 
stand a much broader picture than that reported to you in the newspaper. For 
example, as you can see from the reported confidence intervals, we can’t rule out the 
possibility that the differences in IQ at 1 and 2 years of age were in the other direc- 
tion because the interval covers some negative values. Further, even at 3 and 4 years 
of age, the confidence interval tells us that the gap could have been just slightly 
above zero in the population. 


Reporting Standard Errors of the Mean: Case Study 6.2 


Case Study 6.2 involved a comparison in serum DHEA-S levels for practitioners and 
nonpractitioners of transcendental meditation. The results were presented as tables 
showing the mean DHEA-S level for each 5-year age group; there were separate ta- 
bles for men and women. Confidence intervals were not presented, but values were 
given for standard errors of the means (SEMs). Therefore, confidence intervals could 
be computed from the information given. 

To illustrate exactly how the results were presented, let’s examine one of the ta- 
bles. Figure 21.1 displays the heading and a typical row from Table II of Glaser and 
colleagues (1992, p. 333). In the heading for the table, we are told that the values fol- 


Figure 21.1 

Part of the table from 
the journal article in 
Case Study 6.2 
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Table II Serum DHEA-S Concentrations (+ SEM) in 606 Women 
for Comparison and TM Groups 


Comparison Group TM Group 
Age DHEA-S DHEA-S % Elevation in 
Group N Level (g/dl) N Level (g/dl) T Group 
45 - 49 51 88 + 12 30 117 + 11 34 


Source: Glaser et al., “Elevated serum dehydroepiandrosterone sulfate levels in practitioners of transcen- 
dental mediation (TM) and TM-Sidhi programs. Journal of Behavioral Medicine, vol. 15, no. 4, p. 333. 


lowing the + sign are the standard errors of the means. Using that information, we 
can construct 95% confidence intervals for the mean DHEA-S levels in each group, 
or we can construct a single 95% confidence interval for the difference in the means 
for the meditators and nonmeditators. Constructing the latter interval, we find that a 
95% confidence interval for the difference in means is 


difference in sample means + 2 X square root of [(SEM,)? + (SEM3)7] 
(117 — 88) + 2 X square root of [(11)? + (12)7] 
29 + 2(16.3) 
29 + 32.6 
—3.6 to 61.6 


This interval is probably not easy to interpret if you are not familiar with DHEA-S 
values. Because the interval includes zero, we cannot say, even with 95% confi- 
dence, that the observed difference in sample means represents a real difference in 
the population. However, because the interval extends so much further in the posi- 
tive direction than the negative, the evidence certainly suggests that population 
DHEA-S levels are higher for the meditators than for others. 


Reporting Standard Deviations: Case Study 5.1 


In Case Study 5.1, we looked at the comparison of smoking cessation rates for patients 
using nicotine patches versus those using placebo patches. The main variables of in- 
terest were categorical and thus should be studied using methods for proportions. 
However, the authors also presented numerical summaries of a variety of other char- 
acteristics of the subjects. That information is useful to make sure that the randomiza- 
tion procedure distributed those variables fairly across the two treatment conditions. 

The authors reported means, standard deviations (SD), and ranges (low to high) 
for a variety of characteristics. As an example, Figure 21.2 shows part of a table 
given by Hurt and colleagues (23 February 1994). In the table, the ages for the group 
wearing placebo patches had a mean of 43.6 years and a standard deviation of 10.6 
years, and ranged from 21 to 65 years. 
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Figure 21.2 
Part of a table from 
Case Study 5.1 


Table 1 Baseline Characteristics 


Mean + SD (Range) Active Placebo 

Age, y 42.8 + 11.1(20-65) 43.6 + 10.6(21-65) 
Cigarettes/d(n = 119/119)* 28.8 + 9.4(20-60) 30.6 + 9.4(20-60) 
*The notation (n = 119/119) means that there were 119 people in each group for these calculations. 


Source: Hurt et al., 23 February 1994, p. 596. 


Notice that the intervals given in the table are not 95% confidence intervals, de- 
spite the fact that they are presented in the standard format used for such intervals. 
Be sure to read carefully when you examine results presented in journals so that you 
are not misled into thinking results presented in this format represent 95% confi- 
dence intervals. 

From the information presented, we notice a slight difference in the mean ages 
for each group and in the mean number of cigarettes each group smoked per day at 
the start of the study. Let’s use the information given in the table to compute a 95% 
confidence interval for the difference in number of cigarettes smoked per day, to find 
out if it represents a substantial difference in the populations represented by the two 
groups. (It should be obvious to you that if the placebo group started out smoking 
significantly more, the results of the study would be questionable.) To compute the 
confidence interval, we need the means, standard deviations, and the sample sizes 
(n) presented in the table. Here is how we would proceed with the computation. 


Steps 1 and 2. Compute sample means, standard deviations, and SEMs: 


Active Group Placebo Group 
sample mean = 28.8 cigarettes per day sample mean = 30.6 cigarettes per day 
sample standard deviation = sample standard deviation = 

9.4 cigarettes 9.4 cigarettes 
number of participants = n = 119 number of participants = n = 119 
standard error = SEM, = standard error = SEM, = 


9.4/V 119 = 0.86 9.4/V 119 = 0.86 
Step 3. Square the two standard errors and add them together. Take the square root: 
measure of uncertainty = square root of [(0.86)? + (0.86)*] = 1.2 


Step 4. Compute the interval. A 95% confidence interval for the difference in the 
two population means is 
difference in sample means + 2 X measure of variability 
(28.8 — 30.6) + 2(1.2) 
—-18 + 2.4 
—4.2 to +0.60 
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It appears that there could have been slightly fewer cigarettes smoked per day by 
the group that received the nicotine patches, but the interval covers zero, allowing 
for the possibility that the difference observed in the sample means was opposite in 
direction from the difference in the population means. In other words, we simply 
can’t tell if the difference observed in the sample means represents a real difference 
in the populations. 


Summary of the Variety of Information 

Given in Journals 

There is no standard for how journal articles present results. However, you can de- 
termine confidence intervals for individual means or for the difference in two means 
(for large, independent samples) as long as you are given one of the following sets 
of information: 

1. Direct confidence intervals 

2. Means and standard errors of the means 


3. Means, standard deviations, and sample sizes 


21.4 Understanding Any Confidence Interval 


Not all results are reported as proportions, means, or differences in means. Numer- 
ous other statistics can be computed to make comparisons, almost all of which have 
corresponding formulas for computing confidence intervals. Some of those formu- 
las are quite complex, however. 

In cases where a complicated procedure is needed to compute a confidence in- 
terval, authors of journal articles usually provide the completed intervals. Your job is 
to be able to interpret those intervals. The principles you have learned for under- 
standing confidence intervals for means and proportions are directly applicable to 
understanding any confidence interval. As an example, let’s consider the confidence 
intervals reported in another of our earlier case studies. 


Confidence Interval for Relative Risk: Case Study 5.3 


In Case Study 5.3, we investigated a study relating baldness and heart disease. The 
measure of interest in that study was the relative risk of heart disease based on de- 
gree of baldness. The investigators focused on the relative risk (RR) of myocardial 
infarction (MI)—that is, a heart attack—for men with baldness compared to men 
without any baldness. Here is how they reported some of the results: 


For mild or moderate vertex baldness, the age-adjusted RR estimates were ap- 
proximately 1.3, while for extreme baldness the estimate was 3.4 (95% CI, 1.7 
to 7.0)... . For any vertex baldness (i.e., mild, moderate, and severe com- 
bined), the age-adjusted RR was 1.4 (95% CI, 1.2 to 1.9). (Lesko et al., 1993, 
p. 1000) 
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CASE STUDY 21.1 


The confidence intervals for age-adjusted relative risk are not simple to compute. 
You may notice that they are not of the form “sample value + 2 X (measure of un- 
certainty),” evidenced by the fact that they are not symmetric about the sample val- 
ues given. However, these intervals can be interpreted in the same way as any other 
confidence interval. For instance, with 95% certainty we can say that men with ex- 
treme baldness are at higher risk of heart attack than men with no baldness and that 
the ratio of risks is probably between 1.7 and 7.0. In other words, men with extreme 
baldness are probably anywhere from 1.7 to 7 times more likely to experience a heart 
attack than men of the same age without any baldness. Of course, these results as- 
sume that the men in the study are representative of the larger population. 


Understanding the Confidence Level 


For a 95% confidence interval, the value of 95% is called the confidence level. In 
general, the confidence level is a measure of how much confidence we have that the 
procedure used to generate the interval worked. For a confidence level of 95%, we 
expect that about 95% of all such intervals will actually cover the true population 
value. The remaining 5% will not. Any particular numerical interval computed this 
way either covers the truth or not. The trouble is, when we have an interval in hand, 
we don’t know if it’s one of the 95% “good” ones or the 5% “bad” ones. Thus, we 
say we are 95% confident that the interval captures the true population value. 

There is nothing sacred about 95%, although it is the value most commonly used. 
As noted in Chapter 20, if we changed the multiplier to 1.645 (instead of 2), we 
would construct a 90% confidence interval. To construct a 99% confidence interval, 
we would use a multiplier of 2.576. As you should be aware from the Empirical Rule 
in Chapter 8, a multiplier of 3 would produce a 99.7% confidence interval. In gen- 
eral, the larger the multiplier, the higher the level of confidence that we have cor- 
rectly captured the truth. Of course, the trade-off is that a larger multiplier produces 
a wider interval. 


Premenstrual Syndrome? Try Calcium 


It was front page news in the Sacramento Bee. The headline read “Study says cal- 
cium can help ease PMS,” and the article continued “Daily doses of calcium can re- 
duce both the physical and psychological symptoms of premenstrual syndrome by at 
least half, according to new research that points toward a low-cost, simple remedy 
for a condition that affects millions of women” (Maugh, 26 August 1998). The arti- 
cle described a randomized, double-blind experiment in which women who suffered 
from premenstrual syndrome (PMS) were randomly assigned to take either a placebo 
or 1200 mg of calcium per day in the form of four Tums E-X tablets (Thys-Jacobs 
et al., 1998). Participants included 466 women with a history of PMS: 231 in the cal- 
cium treatment group and 235 in the placebo group. 

The primary measure of interest was a composite score based on 17 PMS symp- 
toms, including 6 that were mood-related, 5 involving water retention, 2 involving 
food cravings, 3 related to pain, and insomnia. Participants were asked to rate each 
of the 17 symptoms daily on a scale from 0 (absent) to 3 (severe). The actual “symp- 
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tom complex score” was the mean rating for the 17 symptoms. Thus, a score of 0 
would imply that all symptoms were absent, and a score of 3 would indicate that all 
symptoms were severe. The original article (Thys-Jacobs et al., 1998) presents re- 
sults individually for each of the 17 symptoms plus the composite score. 

One interesting outcome of this study was that the severity of symptoms was 
substantially reduced for both the placebo and the calcium-treated groups. There- 
fore, comparisons should be made between those two groups rather than examining 
the reduction in scores before and after taking calcium for the treatment group alone. 
In other words, part of the total reduction in symptoms for the calcium-treated group 
could be the result of a “placebo effect.” We are interested in knowing the additional 
influence of taking calcium. 

Let’s compare the severity of symptoms as measured by the composite score for 
the placebo and calcium-treated groups. The treatments were continued for three 
menstrual cycles; we report the symptom scores for the premenstrual period (7 days) 
before treatments began (baseline) and before the third cycle. 

Table 21.2 presents results as given in the journal article, including sample sizes 
and the mean symptom complex scores +1 standard deviation. Notice that sample 
sizes were slightly reduced by the third cycle due to patients dropping out of the 
study. Let’s use the results in the table to compute a confidence interval for what the 
mean differences would be for the entire population of PMS sufferers. 

The purpose of the experiment is to see if taking calcium diminishes symptom 
severity. Because we know that placebos alone can be responsible for reducing 
symptoms, the appropriate comparison is between the placebo and calcium-treated 
groups rather than between the baseline and third cycle symptoms for the calcium- 
treated group alone. 

The difference in means (placebo — calcium) for the third cycle is (0.60 — 0.43) = 
0.17. The “measure of uncertainty” is about 0.039, so a 95% confidence interval for 
the difference is 0.17 + 2(0.039), or about 0.09 to 0.25. 

To put this in perspective, remember that the scores are averages over the 17 
symptoms. Therefore, a reduction from a mean of 0.60 to a mean of 0.43 would, for 
instance, correspond to a reduction from (0.6)(17) = 10 mild symptoms (rating of 1) 
to (0.43)(17) = 7.31, or just over 7 mild symptoms. In fact, examination of the full 
results shows that all 17 symptoms had reduced severity in the calcium-treated group 
compared with the placebo group. And, because this is a randomized experiment and 
not an observational study, we can conclude that the calcium actually caused the re- 
duction in symptoms. 


Table 21.2 Results for Case Study 21.1 


Symptom Complex Score: Mean + SD 


Placebo Group Calcium-Treated Group 


Baseline 0.92 + 0.55 (n = 235) 0.90 + 0.52 (n = 231) 
Third cycle 0.60 + 0.52 (n = 228) 0.43 + 0.40 (n = 212) 
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As a final note, Table 21.2 also indicates a striking drop in the mean symptom 
score from baseline to the third cycle for both groups. For the placebo group, the 
symptom scores dropped by about a third; for the calcium-treated group, they were 
more than cut in half. Thus, it appears that placebos can help reduce the severity of 
PMS symptoms. 


For Those Who Like Formulas 


Review of Notation from Previous Chapters 


Population mean = u, sample mean =X, sample standard deviation = s 


Za/2 = standardized normal score with area a/2 above it 


Standard Error of the Mean 
Standard error of the mean = SEM = s/ Va 


Confidence Interval for a Single Population Mean u 


= sS 
X + Zan r 
en 
Notation for Two Populations and Samples 


Population mean = u; i = 1 or 2 

Sample mean = X; i=lor2 

Sample standard deviation = s;, i = 1 or 2 
Sample size = n, i = 1 or 2 


Standard error of the mean = SEM; = s;/V nin i=lor2 


Confidence Interval for the Difference in 
Two Population Means, Independent Samples 


2 2 
= z S1 S2 
(Xı = X) E Zefa] — 
ny n2 

Exercises Asterisked (*) exercises are included in the Solutions at the back of the book. 


1. In Chapter 20, we saw that to construct a confidence interval for a population 
proportion it was enough to know the sample proportion and the sample size. Is 
the same true for constructing a confidence interval for a population mean? That 
is, is it enough to know the sample mean and sample size? Explain. 


2. Explain the difference between a population mean and a sample mean using one 
of the studies discussed in the chapter as an example. 


13: 


4. 


*6. 


Ta 


*8. 


9. 
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The Baltimore Sun (Haney, 21 February 1995) reported on a study by Dr. Sara 
Harkness in which she compared the sleep patterns of 6-month-old infants in the 
United States and the Netherlands. She found that the 36 U.S. infants slept an 
average of just under 13 hours out of every 24, whereas the 66 Dutch infants 
slept an average of almost 15 hours. 


*a. The article did not report a standard deviation, but suppose it was 0.5 hour 
for each group. Compute the standard error of the mean (SEM) for the U.S. 
babies. 


b. Continuing to assume that the standard deviation is 0.5 hour, compute a 95% 
confidence interval for the mean sleep time for 6-month-old babies in the 
United States. 


c. Continuing to assume that the standard deviation for each group is 0.5 hour, 
compute a 95% confidence interval for the difference in average sleep time 
for 6-month-old Dutch and U.S. infants. 


What is the probability that a 95% confidence interval will not cover the true 
population value? 


. Suppose a university wants to know the average income of its students who 


work and all students supply that information when they register. Would the uni- 
versity need to use the methods in this chapter to compute a confidence interval 
for the population mean income? Explain. (Hint: What is the sample mean and 
what is the population mean?) 


Suppose you were given a 95% confidence interval for the difference in two 
population means. What could you conclude about the population means if 


*a. The confidence interval did not cover zero. 
*b. The confidence interval did cover zero. 


Suppose you were given a 95% confidence interval for the relative risk of dis- 
ease under two different conditions. What could you conclude about the risk of 
disease under the two conditions if 


a. The confidence interval did not cover 1.0. 
b. The confidence interval did cover 1.0. 


In Chapter 20, we learned that to compute a 90% confidence interval, we add 
and subtract 1.645 rather than 2 times the measure of uncertainty. In this chap- 
ter, we revisited Case Study 6.2 and found that a 95% confidence interval for the 
difference in mean DHEA-S levels for 45- to 49-year-old women who meditated 
compared with those who did not extended from —3.6 to 61.6. 


*ą. Compute a 90% confidence interval for the difference in mean DHEA-S 
levels for this group. 


*b. Based on your result in part a, could you conclude with 90% confidence that 
the difference observed in the samples represents a real difference in the pop- 
ulations? Explain. 


In revisiting Case Study 6.2, we computed a confidence interval for the differ- 
ence in mean DHEA-S levels for 45- to 49-year-old women meditators and non- 
meditators and concluded that there probably was a real difference in the 
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10. 


*11. 


12. 


13. 


population means because most of the interval was above zero. Now we will 
compute results for the two groups separately and compare them. 


a. Compute a 95% confidence interval for the mean DHEA-S level for 45- to 
49-year-old women in the comparison group. 


b. Compute a 95% confidence interval for the mean DHEA-S level for 45- to 
49-year-old women in the TM group. 


c. Based on the intervals computed in parts a and b, would you conclude that 
there is a difference in population means between the comparison and TM 
groups? Explain. 

In Case Study 6.4, which examined maternal smoking and child’s IQ, one of the 

results reported in the journal article was the average number of days the infant 

spent in the neonatal intensive care unit. The results showed an average of 0.35 

day for infants of nonsmokers and an average of 0.58 day for the infants of 

women who smoked 10 or more cigarettes per day. In other words, the infants 

of smokers spent an average of 0.23 day more in neonatal intensive care. A 95% 

confidence interval for the difference in the two means extended from —3.02 

days to +2.57 days. Explain why it would have been misleading to report, 

“these results show that the infants of smokers spend more time in neonatal in- 

tensive care than do the infants of nonsmokers.” 


In a study comparing age of death for left- and right-handed baseball players, 
Coren and Halpern (1991, p. 93) provided the following information: “Mean age 
of death for strong right-handers was 64.64 years (SD = 15.5, n = 1472); mean 
age of death for strong left-handers [was] 63.97 years (SD = 15.4, n = 236).” 
The term “strong handers” applies to baseball players who both threw and bat- 
ted with the same hand. The data were actually taken from entries in The Base- 
ball Encyclopedia (6th ed., New York: Macmillan, 1985), but, for the purposes 
of this exercise, pretend that the data were from a sample drawn from a larger 
population. 


*a. Compute a 95% confidence interval for the mean age of death for the popu- 
lation of strong right-handers from which this sample was drawn. 


b. Repeat part a for the strong left-handers. 


c. Compare the results from parts a and b in two ways. First, explain why one 
confidence interval is substantially wider than the other. Second, explain 
whether you would conclude that there is a difference in the mean ages of 
death for left- and right-handers on the basis of these results. 


d. Compute a 95% confidence interval for the difference in mean ages of death 
for the strong right- and left-handers. Interpret the result. 


In revisiting Case Study 5.3, we quoted the original journal article as reporting 
that “for any vertex baldness (i.e., mild, moderate, and severe combined), the 
age-adjusted RR was 1.4 (95% CI, 1.2 to 1.9)” (Lesko et al., 1993, p. 1000). In- 
terpret this result. Explain in words that someone with no training in statistics 
would understand. 


In a report titled, “Secondhand Smoke: Is It a Hazard?” (Consumer Reports, 
January 1995, pp. 27-33), 26 studies linking secondhand smoke and lung can- 


14. 


15. 


*16. 


*17. 
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cer were summarized by noting, “those studies estimated that people breathing 
secondhand smoke were 8 to 150 percent more likely to get lung cancer some- 
time later” (p. 28). Although it is not explicit, assume that the statement refers 
to a 95% confidence interval and interpret what this means. 


Refer to Case Study 21.1, illustrating the role of calcium in reducing the 
symptoms of PMS. Using the caution given at the end of the section, explain 
why we cannot use the method presented in Section 21.2 to compare baseline 
symptom scores with third-cycle symptom scores for the calcium-treated 
group alone. 


Parts a-d below provide additional results for Case Study 21.1. For each of the 
parts, compute a 95% confidence interval for the difference in mean symptom 
scores between the placebo and calcium-treated conditions for the symptom 
listed. In each case, the results given are mean + standard deviation. There were 
228 participants in the placebo group and 212 in the calcium-treated group. 


a. Mood swings: placebo = 0.70 + 0.75; calcium = 0.50 + 0.58 

b. Crying spells: placebo = 0.37 + 0.57; calcium = 0.23 + 0.40 

c. Aches and pains: placebo = 0.49 + 0.60; calcium = 0.31 + 0.49 

d. Craving sweets or salts: placebo = 0.60 = 0.78; calcium = 0.43 + 0.64 


A story in Newsweek (14 November 1994, pp. 52-54) reported the results of a 
poll asking 756 American adults the question, “Do you think Clarence Thomas 
sexually harassed Anita Hill, as she charged three years ago?” The results of a 
similar poll 3 years earlier were also reported. The original poll was conducted 
in October 1991, during the time of congressional hearings in which Hill made 
allegations of sexual harassment against Supreme Court nominee Thomas. The 
proportion answering yes was 23% in 1991 and 40% in 1994. A 95% confidence 
interval for the difference in proportions answering yes in 1994 versus 1991 is 
17% + 5%. Compute and interpret the interval. Does it indicate that opinion on 
the issue among American adults had definitively changed during the 3 years be- 
tween polls? 


Using the data presented by Hand and colleagues (1994) and discussed in pre- 
vious chapters, we would like to estimate the average age difference between 
husbands and wives in Britain. Recall that the data consisted of a random sam- 
ple of 200 couples. Following are two methods that were used to construct a 
confidence interval for the difference in ages. Your job is to figure out which 
method is correct: 


Method 1: Take the difference between the husband’s age and the wife’s age for 
each couple, and use the differences to construct a 95% confidence interval for 
a single mean. The result was an interval from 1.6 to 2.9 years. 

Method 2: Use the method presented in this chapter for constructing a confi- 
dence interval for the difference in two means for two independent samples. The 
result was an interval from —0.4 to 4.3 years. 

Explain which method is correct, and why. Then interpret the confidence inter- 
val that resulted from the correct method. 
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18. 


19. 


20. 


O *21. 


Refer to Exercise 17. Suppose that from that same data set, we want to compute 
the average difference between the heights of adult British men and adult British 
women—not the average difference within married couples. 


a. Which of the two methods in Exercise 17 would be appropriate for this 
situation? 


b. The 200 men in the sample had a mean height of 68.2 inches, with a stan- 
dard deviation of 2.7 inches. The 200 women had a mean height of 63.1 
inches, with a standard deviation of 2.5 inches. Assuming these were inde- 
pendent samples, compute a 95% confidence interval for the mean difference 
in heights between British males and females. Interpret the resulting interval 
in words that a statistically naive reader would understand. 


Refer to Case Study 21.1 and the material in Part | of this book. 


a. In their original report, Thys-Jacobs and colleagues (1998) noted that the 
study was “double-blind.” Explain what that means in the context of this 
example. 


b. Explain why it is possible to conclude that, based on this study, calcium ac- 
tually causes a reduction in premenstrual symptoms. 


Refer to the following statement on page 396: “For example, as you can see 
from the reported confidence intervals, we can’t rule out the possibility that the 
differences in IQ at 1 and 2 years of age were in the other direction because the 
interval covers some negative values.” The statement refers to a confidence in- 
terval given in the previous paragraph, ranging from —3.03 to 8.20. Write a 
paragraph explaining how to interpret this confidence interval that would be un- 
derstood by someone with no training in statistics. Make sure you are clear 
about the population to which the result applies. 


Refer to Original Source 5 on the CD, “Distractions in Everyday Driving.” Table 
14 on page 51 provides 95% confidence intervals for the average percent of time 
drivers in the population would be observed not to have their hands on the wheel 
during various activities while the vehicle was moving, assuming they were like 
the drivers in this study. The confidence intervals were computed using a differ- 
ent method than the one presented in this book because of the type of data avail- 
able, but the interpretation is the same. (See Appendix D of the report if you are 
interested in the details.) 


a. The confidence interval accompanying “Reading/writing” is from 4.24% to 
34.39%. Write a few sentences interpreting this interval that would be un- 
derstood by someone with no training in statistics. Make sure you are clear 
about the population to which the result applies. 


b. Repeat part a for the confidence interval for “Conversing.” 


*e. Refer to the interval in part a. Write it as a 95% confidence interval for the 
average number of minutes out of an hour that drivers who were reading or 
writing would be observed to have their hands off of the wheel. 


d. Notice that in Table 14 some of the intervals are much wider than others. For 
instance, the 95% confidence interval for the percent of time with hands off 


ed 
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the wheel while not reading or writing is from 0.97 to 1.93, compared with 
the interval from 4.24 to 34.39 while reading or writing. What two features 
of the data do you think are responsible for some intervals being wider than 
others? (Hint: What two features of the data determine the width of a 95% 
confidence interval for a mean? The answer is the same in this situation.) 


Refer to Original Source 9 on the CD, “Suicide Rates in Clinical Trials of 
SSRIs, Other Antidepressants, and Placebo: Analysis of FDA Reports.” Table 1 
in that paper (page 791) provides confidence intervals for suicide rates for pa- 
tients taking three different kinds of medications. For example, the 95% confi- 
dence interval for “Placebo” is 0.01 to 0.19. This interval means that we can be 
95% confident that between 0.01% and 0.19% of the population of people sim- 
ilar to the ones in these studies would commit suicide while taking a placebo. 
Notice that 0.01% is about 1 out of 10,000 people and 0.19% is about 1 out of 
526 people, so this is a wide interval in terms of the actual suicide rate. Also note 
that the population from which the samples were drawn consisted of depressed 
patients who sought medical help. 


a. Report (from Table 1 in the article) the 95% confidence interval for the sui- 
cide rate for patients taking “selective serotonin reuptake inhibitors” 
(SSRIs). Write a sentence interpreting this interval. 


b. Compare the interval for patients taking placebos with the one of those tak- 
ing SSRIs. What can you conclude about the effectiveness of SSRIs in pre- 
venting suicide, based on a comparison between the two intervals? 


c. Read the article to determine whether the results are based on observational 
studies or randomized experiments. Based on that determination, can you 
conclude that differences that might have been observed in suicide rates for 
the different groups were caused by the difference in type of drug treatment? 
Explain. 


. Refer to Original Source 10 on the CD, “Religious attendance and cause of 


death over 31 years.” Table 2 in the article provides 95% confidence intervals 
for the relative risk of death by various causes for those who attend religious 
services less than weekly versus weekly. (The relative risk is called “relative 
hazard” in the article and adjustments already have been made for some con- 
founding factors such as age.) 


a. Report the 95% confidence interval for “All Causes.” Explain what it means 
in a few sentences that would be understood by someone with no training in 
statistics. Make sure your explanation applies to the correct population. 


b. What value of relative risk would indicate equal risk for those who attend 
and do not attend religious services at least weekly? 


c. Confidence intervals are given for the relative risk of death for five specific 
causes of death (circulatory, cancer, digestive, respiratory and external). For 
which of these causes can it be concluded that the risk of death in the popu- 
lation is lower for those who attend religious services at least weekly? Ex- 
plain what criterion you need to determine your answer. (Hint: Refer to part 
b of this exercise.) 
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d. Read the article to determine whether the results are based on an observa- 
tional study or a randomized experiment. Using that information, explain 
whether it can be concluded that attending religious services at least weekly 
causes the risk of death to change, for causes of death for which the confi- 
dence interval indicates there is such a change. 

Co) 24. Refer to Original Source 11, “Driving impairment due to sleepiness is exacer- 
bated by low alcohol intake.” Table 1 on the top of page 691 presents mean 
blood alcohol concentrations (BAC) and standard errors of the mean (SE in 
Table 1) for the participants before driving. The values are given for the “With 
sleep restriction” and “No sleep restriction” conditions. 


a. Compute 95% confidence intervals for the mean BAC before driving for 
each of the two conditions. (Note that the sample sizes are smaller than may 
be appropriate for using the method in this chapter, but that will cause the in- 
tervals to be just slightly too narrow, or alternatively, the confidence to be 
slightly less than 95%. You can ignore that detail.) 


b. Interpret the interval for the “With sleep restriction” condition. Make sure 
your explanation refers to the correct population. 


c. Ignoring the fact that the sample size is slightly too small, would it be ap- 
propriate to compute a confidence interval for the difference in the two 
means using the method provided in this chapter? If so, compute the inter- 
val. If not, explain why not. You may have to read the article to determine 
whether the appropriate condition is met. 


25. Table 1 on page 200 of Original Source 18, “Birth weight and cognitive func- 
tion in the British 1946 birth cohort: longitudinal population based study” (not 
available for the CD), provides 95% confidence intervals for the difference in 
mean standardized cognitive scores for midrange birthweight babies (3.01 to 
3.50 kg) and babies in each of four other birthweight groups. The comparisons 
are made at various ages. For scores at age 8, the intervals are —0.42 to —0.11 
for low weight (0 to 2.50 kg at birth), —0.16 to +0.03 for low-normal weight 
(2.51 to 3.00 kg at birth), 0.05 to 0.21 for high-normal weight (3.51 to 4.00 kg 
at birth), and —0.08 to +0.14 for very high-normal weight (4.01 to 5.00 kg at 
birth). 

a. Write a sentence or two interpreting the interval for the low-weight group. 
Make sure your explanation applies to the correct population. 


b. What difference value would indicate that the two means are equal in the 
population? 

c. For which of the birthweight groups is it clear that there is a difference in the 
population mean standardized cognitive scores for that group and the 
midrange group? What criterion did you use to decide? In each case, explain 
whether the difference indicates that the mean standardized cognitive score 
is higher or lower for that group than for the midrange birthweight group. 


d. Do the results of this study imply that low birth weight causes lower cogni- 
tive scores? Explain. 
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Mini-Projects 


1. Find a journal article that reports at least one 95% confidence interval. Explain 
what the study was trying to accomplish. Give the results as reported in the ar- 
ticle in terms of 95% confidence intervals. Interpret the results. Discuss whether 
you think the article accomplished its intended purpose. In your discussion, in- 
clude potential problems with the study, as discussed in Chapters 4 to 6. 


2. Collect data on a measurement variable for which the mean is of interest to you. 
Collect at least 30 observations. Using the data, compute a 95% confidence in- 
terval for the mean of the population from which you drew your observations. 
Explain how you collected your sample and note whether your method would 
be likely to result in any biases if you tried to extend your sample results to the 
population. Interpret the 95% confidence interval to make a conclusion about 
the mean of the population. 


3. Collect data on a measurement variable for which the difference in the means 
for two conditions or groups is of interest to you. Collect at least 30 observa- 
tions for each condition or group. Using the data, compute a 95% confidence in- 
terval for the difference in the means of the populations from which you drew 
your observations. Explain how you collected your samples and note whether 
your method would be likely to result in any biases if you tried to extend your 
sample results to the populations. Interpret the 95% confidence interval to make 
a conclusion about the difference in the means of the populations or conditions. 
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Rejecting Chance— Testing 
Hypotheses in Research 


Thought Questions 


1. In the courtroom, juries must make a decision about the guilt or innocence of a de- 
fendant. Suppose you are on the jury in a murder trial. It is obviously a mistake if the 
jury claims the suspect is guilty when in fact he or she is innocent. What is the other 
type of mistake the jury could make? Which is more serious? 


2. Suppose exactly half, or 0.50, of a certain population would answer yes when asked 
if they support the death penalty. A random sample of 400 people results in 220, or 
0.55, who answer yes. The Rule for Sample Proportions tells us that the potential 
sample proportions in this situation are approximately bell-shaped, with standard de- 
viation of 0.025. Using the formula on page 156, find the standardized score for the 
observed value of 0.55. Then determine how often you would expect to see a stan- 
dardized score at least that large or larger. 


3. Suppose you are interested in testing a claim you have heard about the proportion 
of a population who have a certain trait. You collect data and discover that if the 
claim is true, the sample proportion you have observed is so large that it falls at the 
99th percentile of possible sample proportions for your sample size. Would you be- 
lieve the claim and conclude that you just happened to get a weird sample, or would 
you reject the claim? What if the result was at the 70th percentile? At the 99.99th 
percentile? 


4. Which is generally more serious when getting results of a medical diagnostic test: a 
false positive, which tells you you have the disease when you don't, or a false nega- 
tive, which tells you you do not have the disease when you do? 


411 


412 PART 4 Making Judgments from Surveys and Experiments 


22.1 Using Data to Make Decisions 


EXAMPLE 1 


In Chapters 20 and 21, we computed confidence intervals based on sample data to 
learn something about the population from which the sample had been taken. We 
sometimes used those confidence intervals to make a decision about whether there 
was a difference between two conditions. 


Examining Confidence Intervals 


When we examined the confidence interval for the relative risk of heart attacks for 
men with vertex baldness compared with no baldness, we noticed that the interval 
(1.2 to 1.9) was entirely above a relative risk of 1.0. Remember that a relative risk of 
1.0 is equivalent to equal risk for both groups, whereas a relative risk above 1.0 
means the risk is higher for the first group. In this example, if the confidence 
interval had included 1.0, then we could not say whether the risk of heart attack is 
higher for men with vertex baldness or for men with no hair loss, even with 95% 
confidence. 

Using another example from Chapter 21, we noticed that the confidence interval 
for the difference in mean weight loss resulting from dieting alone versus exercising 
alone was entirely above zero. From that, we concluded, with 95% confidence, that 
the mean weight loss using dieting alone would be higher than it would be with ex- 
ercise alone. If the interval had covered (included) zero, we would not have been 
able to say, with high confidence, which method resulted in greater average weight 
loss in the population. 


Hypothesis Tests 


As we learned in Chapter 13, researchers interested in answering direct questions of- 
ten conduct hypothesis tests. Remember the basic question researchers ask when 
they conduct such a test: 


Is the relationship observed in the sample large enough to be called statisti- 
cally significant, or could it have been due to chance? 


In this chapter, we learn more about the basic thinking that underlies hypothesis 
testing. In the next chapter, we learn how to carry out some simple hypothesis tests 
and examine some in-depth case studies. 


Deciding if Students Prefer Quarters or Semesters 


To illustrate the idea behind hypothesis testing, let's look at a simple, hypothetical ex- 
ample. There are two basic academic calendar systems in the United States: the quarter 
system and the semester system. Universities using the quarter system generally have 
three 10-week terms of classes, whereas those using the semester system have classes 
for two 15-week terms. 

Suppose a university is currently on the quarter system and is trying to decide 
whether to switch to the semester system. Administrators are leaning toward switching 
to the semester system, but they have heard that the majority of students may oppose 
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the switch. They decide to conduct a survey to see if there is convincing evidence that a 
majority of students opposes the plan, in which case they will reconsider their proposed 
change. 

Administrators must choose from two hypotheses: 


1. There is no clear preference (or the switch is preferred), so there is no problem. 


2. As rumored, a majority of students oppose the switch, so the administrators should 
stop their plan. 


The administrators pick a random sample of 400 students and ask their opinions. Of 
the 400, 220 of them say they oppose the switch. Thus, a clear majority of the sample, 
0.55 (55%), is opposed to the plan. Here is the question that would be answered by a 
hypothesis test: 


If there is really no clear preference, how likely would we be to observe sample 
results of this magnitude or larger, just by chance? 


We already have the tools to answer this question. From the Rule for Sample Pro- 
portions, we know what to expect if there is no clear preference—that is, if, in truth, 
50% of the students prefer each system: 


If numerous samples of 400 students are taken, the frequency curve for the 
proportions from the various samples will be approximately bell-shaped. The 
mean will be the true proportion from the population—in this case, 0.50. 
The standard deviation will be: 


the square root of: (true proportion) X (1 — true proportion)/(sample size) 


In this case, the square root of [(0.5)(0.5)/400] = 0.025. 


In other words, if there is truly no preference, then the observed value of 0.55 must have 
come from a bell-shaped curve with a mean of 0.50 and a standard deviation of 0.025. 

How likely would a value as large as 0.55 be from that particular bell-shaped curve? 
To answer that, we need to compute the standardized score corresponding to 0.55: 


standardized score = z-score = (0.55 — 0.50)/0.025 = 2.00 


From Table 8.1 in Chapter 8, we find that a standardized score of 2.00 falls between 
1.96 and 2.05, the values for the 97.5th and 98th percentiles. That is, if there is truly no 
preference, then we would observe a sample proportion as high as this (or higher) be- 
tween 2% and 2.5% of the time. (Using Excel or other software provides a more pre- 
cise answer of 2.3%.) 

The administration must now make a decision. One of two things has happened: 


1. There really is no clear preference, but by the “luck of the draw” this particular sam- 
ple resulted in an unusually high proportion opposed to the switch. In fact, it is so 
high that chance would lead to such a high value only slightly more than 2% of the 
time. 


2. There really is a preference against switching to the semester system. The proportion 
(of all students) against the switch is actually higher than 0.50. 
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Most researchers agree that, by convention, we can rule out chance if the “luck of 
the draw” would have produced such extreme results less than 5% of the time. There- 
fore, in this case, the administrators should probably decide to rule out chance. The 
proper conclusion is that, indeed, a majority is opposed to switching to the semester 
system. 

When a relationship or value from a sample is so strong that we can effectively rule 
out chance in this way, we say that the result is statistically significant. In this case, we 
would say that the percent of students who are opposed to switching to the semester 
system is statistically significantly higher than 50%. E 


22.2 The Basic Steps for Testing Hypotheses 


EXAMPLE 2 


Although the specific computational steps required to test hypotheses depend on the 
type of variables measured, the basic steps are the same for all hypothesis tests. In 
this section, we review the basic steps, first presented in Chapter 13. 


The basic steps for testing hypotheses are 


1. Determine the null hypothesis and the alternative hypothesis. 


2. Collect the data and summarize them with a single number called a test 
statistic. 


3. Determine how unlikely the test statistic would be if the null hypothesis 
were true. 


4. Make a decision. 


Step 1. Determine the null hypothesis and the alternative hypothesis. There are al- 
ways two hypotheses. The first one is called the null hypothesis—the hypothesis 
that says nothing is happening. The exact interpretation varies, but it can generally 
be thought of as the status quo, no relationship, chance only, or some variation on 
that theme. 

The second hypothesis is called the alternative hypothesis or the research hy- 
pothesis. This hypothesis is usually the reason the data are being collected in the 
first place. The researcher suspects that the status quo belief is incorrect or that there 
is indeed a relationship between two variables that has not been established before. 
The only way to conclude that the alternative hypothesis is the likely one is to have 
enough evidence to effectively rule out chance, as presented in the null hypothesis. 


A Jury Trial 

If you are on a jury in the American judicial system, you must presume that the defen- 
dant is innocent unless there is enough evidence to conclude that he or she is guilty. 
Therefore, the two hypotheses are 
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Null hypothesis: The defendant is innocent. 
Alternative hypothesis: The defendant is guilty. 


The trial is being held because the prosecution believes the status quo assumption 
of innocence is incorrect. The prosecution collects evidence, much like researchers col- 
lect data, in the hope that the jurors will be convinced that such evidence would be ex- 
tremely unlikely if the assumption of innocence were true. a 


For Example 1, the two hypotheses were 


Null hypothesis: There is no clear preference for quarters over semesters. 


Alternative hypothesis: The majority opposes the switch to semesters. 


The administrators are collecting data because they are concerned that the null hy- 
pothesis is incorrect. If, in fact, the results are extreme enough (many more than 50% 
of the sample oppose the switch), then the administrators must conclude that a ma- 
jority of the population of students opposes semesters. 


Step 2. Collect the data and summarize them with a single number called a test sta- 
tistic. Recall how we summarized the data in Example 1, when we tried to determine 
whether a clear majority of students opposed the semester system. In the final analy- 
sis, we based our decision on only one number, the standardized score for our sam- 
ple proportion. 

In general, the decision in a hypothesis test is based on a single summary of the 
data. This summary is called the test statistic. We encountered this idea in Chap- 
ter 13, where we used the chi-square statistic to determine whether the relationship 
between two categorical variables was statistically significant. In that kind of prob- 
lem, the chi-square statistic is the only summary of the data needed to make the de- 
cision. The chi-square statistic is the test statistic in that situation. The standardized 
score was the fest statistic for Example 1 in this chapter. 


Step 3. Determine how unlikely the test statistic would be if the null hypothesis were 
true. In order to decide whether the results could be just due to chance, we ask the 
following question: 


If the null hypothesis is really true, how likely would we be to observe sample 
results of this magnitude or larger (in a direction supporting the alternative hy- 
pothesis) just by chance? 


Answering that question usually requires special tables, such as Table 8.1 for stan- 
dardized scores, a computer with Excel or other software, or a calculator with sta- 
tistical functions. Fortunately, most researchers answer the question for you in their 
reports and you must simply learn how to interpret the answer. The numerical value 
giving the answer to the question is called the p-value. 


The p-value is computed by assuming the null hypothesis is true, and then 
asking how likely we would be to observe such extreme results (or even more 
extreme results) under that assumption. 
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Many statistical novices misinterpret the meaning of a p-value. The p-value does 
not give the probability that the null hypothesis is true. There is no way to do that. 
For example, in Case Study 1.2, we noticed that for the men in that sample, those 
who took aspirin had fewer heart attacks than those who took a placebo. The p-value 
for that example tells us the probability of observing a relationship that extreme or 
more so in a sample of that size if there really is no difference in heart attack rates 
for the two conditions in the population. There is no way to determine the probabil- 
ity that aspirin actually has no effect on heart attack rates. In other words, there is no 
way to determine the probability that the null hypothesis is true. Don’t fall into the 
trap of believing that common misinterpretation of the p-value. 


Step 4. Make a decision. Once we know how unlikely the results would have been 
if the null hypothesis were true, we must make one of two choices: 


Choice 1: The p-value is not small enough to convincingly rule out chance. 
Therefore, we cannot reject the null hypothesis as an explanation for the 
results. 


Choice 2: The p-value is small enough to convincingly rule out chance. We re- 
ject the null hypothesis and accept the alternative hypothesis. 


Notice that it is not valid to actually accept that the null hypothesis is true. To do so 
would be to say that we are essentially convinced that chance alone produced the ob- 
served results. This is another common mistake, which we will explore in Chap- 
ter 24. 

Making choice 2 is equivalent to declaring that the result is statistically signifi- 
cant. We can rephrase the two choices in terms of statistical significance as follows: 


Choice 1: There is no statistically significant difference or relationship evi- 
denced by the data. 


Choice 2: There is a statistically significant difference or relationship evi- 
denced by the data. 


You may be wondering how small the p-value must be in order to be small enough 
to rule out the null hypothesis. The cut off point is called the level of significance. 
The standard used by most researchers is 5%. However, that is simply a convention 
that has become accepted over the years, and there are situations for which that value 
may not be wise. (We explore this issue further in the Section 22.4.) 

Let’s return to the analogy of the jury in a courtroom. In that situation, the in- 
formation provided is generally not summarized into a single number. However, the 
two choices are equivalent to those in hypothesis testing: 


Choice 1: The evidence is not strong enough to convincingly rule out that the 
defendant is innocent. Therefore, we cannot reject the null hypothesis, or inno- 
cence of the defendant, based on the evidence presented. We say that the defen- 
dant is not guilty. Notice that we do not conclude that the defendant is 
innocent, which would be akin to accepting the null hypothesis. 


Choice 2: The evidence was strong enough that we are willing to rule out the 
possibility that an innocent person (as stated in the null hypothesis) produced 
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the observed data. We reject the null hypothesis, that the defendant is innocent, 
and assert the alternative hypothesis, that he or she is guilty. 


Consistent with our thinking in hypothesis testing, for Choice 1 we would not accept 
the hypothesis that the defendant is innocent. We would simply conclude that the ev- 
idence was not strong enough to rule out the possibility of innocence and conclude 
that the defendant was not guilty. 


22.3 Testing Hypotheses for Proportions 


Let’s illustrate the four steps of hypothesis testing for the situation where we are test- 
ing whether a population proportion is equal to a specific value, as in Example 1. In 
that example we outlined the steps informally, and in this section we will be explicit 
about them. We will use the following example as we review the four steps of hy- 
pothesis testing in the context of one proportion. 


Family Structure in the Teen Drug Survey 

According to data from the U.S. government (http://www.census.gov/population/ 
www/socdemo/hh-fam/cps2002.html), 67% of children aged 12 to 17 in the United 
States in 2001 were living with two parents, either biological or step parents. News Story 
13 and Original Source 13 on the CD describe a survey of teens and their parents using 
a random sample of teens in the United States, aged 12 to 17. In Original Source 13, 
following question 8 on page 38, a summary is given showing that 84% of the teens in 
the survey were living with two parents. Does this mean that the population represented 
by the teens who were willing to participate in the survey has a higher proportion living 
with both parents than does the general population in that age group? Or, could the 
84% found in the sample for this survey simply represent chance variation based on the 
fact that there were only 1987 teens in the study? We will test the null hypothesis that 
the population proportion in this case is 0.67 (67%) versus the alternative hypothesis 
that it is not, where the population is all teens who would be willing and able to par- 
ticipate in this survey if they had been asked. In other words, we are testing whether the 
population represented by the survey is the same as the general population in terms of 
family structure. E 


Step 1. Determine the null hypothesis and the alternative hypothesis. Researchers 
are often interested in testing whether a population proportion is equal to a specific 
value, which we will call the null value. In all cases, the null hypothesis is that the 
population proportion is equal to that value. However, the alternative hypothesis de- 
pends on whether the researchers have a preconceived idea about which direction the 
difference will take, if there is one. It is definitely not legitimate to look at the data 
first and then decide! But, if the original hypothesis of interest is in only one direc- 
tion, then it is legitimate to include values only in that direction as part of the alter- 
native hypothesis. If values above the null value only or below the null value only 
are included in the alternative hypothesis, the test is called a one-sided test or a one- 
tailed test. If values on either side of the null value are included in the alternative 
hypothesis, the test is called a two-sided test or a two-tailed test. 
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EXAMPLE 1 
CONTINUED 


EXAMPLE 3 
CONTINUED 


To write the hypothesis, researchers must first specify what population they are 
measuring and what proportion is of interest. For instance, in Example 1 in this 
chapter, the population of interest was all students at the university and the propor- 
tion of interest was the proportion of them that oppose switching to the semester sys- 
tem. In Example 3, the population of interest was all teens who would have been 
willing and able (parental permission was required) to participate in the survey if 
they had been asked. The proportion of interest was the proportion of them living 
with both parents. 

Once the population has been specified, the null value is determined by the re- 
search question of interest. Then the null hypothesis is written as follows: 


Null hypothesis: The population proportion of interest equals the null value. 
The alternative hypothesis is one of the following, depending on the research question: 


Alternative hypothesis: The population proportion of interest does not equal the 
null value. (This is a two-sided hypothesis.) 


Alternative hypothesis: The population proportion of interest is greater than the 
null value. (This is a one-sided hypothesis.) 


Alternative hypothesis: The population proportion of interest is less than the 
null value. (This is a one-sided hypothesis.) 


Switching to Semesters 


For Example 1, the administration was only concerned with discovering the situation 
where a majority of students was opposed to switching. So, the test was one-sided with 
hypotheses: 


Null hypothesis: The proportion of students at the university who oppose switch- 
ing to semesters is 0.50. 


Alternative hypothesis: The proportion of students at the university who oppose 
switching to semesters is greater than 0.50. E 


Teen Drug Survey 


For Example 3, the researchers should be concerned about whether their sample is rep- 
resentative of all teens in terms of family structure. Based on data from the U.S. gov- 
ernment, it was known that at the time about 67% of teens lived with both parents. 
Therefore, the appropriate test would examine whether the population of teens repre- 
sented by the sample had 67% or something different than 67% of the teens living with 
both parents. (It would be dishonest to look at the data first, then write the hypothesis 
to fit it.) Thus, in this case the alternative hypothesis should be two-sided. The appro- 
priate hypotheses are: 


Null hypothesis: For the population of teens represented by the survey, the propor- 
tion living with both parents is 0.67. 

Alternative hypothesis: For the population of teens represented by the survey, the 
proportion living with both parents is not equal to 0.67. E 


Step 2. Collect the data and summarize them with a single number called a test sta- 
tistic. The key when testing whether a population proportion differs from the null 
value based on a sample is to compare the null value to the corresponding propor- 


EXAMPLE 1 
CONTINUED 


EXAMPLE 3 
CONTINUED 
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tion in the sample, simply called the sample proportion. For instance, in Exam- 
ple 1 the sample proportion (opposing the switch to semesters) was 220/400 or 0.55 
and we want to know how far off that is from the null value of 0.50. In Example 3 
the sample proportion (living with both parents) was reported as 0.84 and we want 
to know how far off that is from the null value of 0.67. In general, we want to know 
how far the sample value is from the null value, if the null hypothesis is true. If we 
know that, then we can find out how unlikely the sample value would be if the null 
value is the actual population proportion. 

It is, of course, easy to find the actual difference between the sample value and 
the null value, but that doesn’t help us because we don’t know how to assess whether 
the difference is large enough to indicate a real population difference or not. To make 
that assessment, we need to know how many standard deviations apart the two val- 
ues are. Therefore, the fest statistic when testing a single proportion is the standard- 
ized score measuring the distance between the sample proportion and the null value: 


sample proportion — null value 


test statistic = standardized score = z-score = — 
standard deviation 


where 


(null value) X (1 — null value) 


standard deviation = - 
sample size 


Switching to Semesters 
For Example 1, the null value is 0.50, the sample proportion is 0.55, and the sample size 


is 400. Therefore, the standard deviation is the square root of [(0.5)(0.5)/400] = 0.025. 
The test statistic, computed in Example 1, is (0.55 — 0.50)/0.025 = 2.00. E 


Teen Drug Survey 

For Example 3, the null value is 0.67, the sample proportion is 0.84, and the sample size 
is 1987. Therefore, the standard deviation is the square root of [(0.67)(1 — 0.67)/1987] = 
0.0105. The test statistic is (0.84 — 0.67)/0.0105 = 0.17/0.0105 = 16, which is extremely 
large for a standardized score. It is virtually impossible that 84% of the sample would be 
living with both parents if in fact only 67% of the population were doing so. a 


Step 3. Determine how unlikely the test statistic would be if the null hypothesis were 
true. In this step, we compute the p-value. We find the probability of observing a 
standardized score as far from the null value or more so, in the direction specified in 
the alternative hypothesis, if the null hypothesis is true. To find this value, use a 
table, calculator or computer package. The correct probability depends on whether 
the test is a one-sided or two-sided test. The method is as follows: 


p-value = Proportion of 
Alternative Hypothesis: Bell-Shaped Curve: 


Proportion is greater than null value. Above the z-score test statistic value. 
Proportion is less than null value. Below the z-score test statistic value. 
Proportion is not equal to null value. [Above absolute value of test statistic] X 2. 
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EXAMPLE 1 
CONTINUED 


EXAMPLE 3 
CONTINUED 


EXAMPLE 1 


CONTINUED 


Switching to Semesters 

For Example 1, the alternative hypothesis was one-sided, that the proportion opposed 
to switching to semesters is greater than the null value of 0.50. The z-score test statis- 
tic value was computed to be 2.00. Therefore, the p-value is the proportion of the bell- 
shaped curve above 2.00. From Table 8.1, we see that the proportion below 2.00 is 
between 0.975 and 0.98. Therefore, the proportion above 2.00 is between 0.025 
(that is, 1 — 0.975) and 0.02. We can find the exact proportion below 2.00 by using 
the Excel statement NORMSDIST(2.0); the result is 0.9772. Thus, the exact p-value is 
(1 — 0.9772) = 0.0228. E 


Teen Drug Survey 

For Example 3, the alternative hypothesis was two-sided, that the proportion of the pop- 
ulation living with two parents was not equal to 0.67. The z-score test statistic was com- 
puted to be 16. Therefore the p-value is two times the proportion of the bell-shaped curve 
above 16. The proportion above 16 is essentially O. In fact, from Table 8.1 you can see 
that the proportion above 6.0 is already essentially 0. So, in this case, the p-value is es- 
sentially O. It is almost impossible to observe a sample of 1987 teens in which 84% are 
living with both parents if in fact only 67% of the population are doing so. Clearly, we 
will reject the null hypothesis that the true population proportion in this case is 0.67. m 


Step 4. Make a decision. This step is the same for any situation, but the wording 
may differ slightly based on the context. The researcher specifies the level of signif- 
icance and then determines whether or not the p-value is that small or smaller. If so, 
the result is statistically significant. The most common level of significance is 0.05. 
There are multiple ways in which the decision can be worded. Here are some generic 
versions, as well as how it would be worded in the case of testing one proportion. 

If the p-value is greater than the level of significance, 

m Do not reject the null hypothesis. 


m The true population proportion is not significantly different from the null 
value. 


If the p-value is less than or equal to the level of significance, 
m Reject the null hypothesis. 
m Accept the alternative hypothesis. 


m The true population proportion is significantly different from the null value. 


For the last version, if the test is one-sided, the decision would be worded accord- 
ingly. 


Switching to Semesters 


In Example 1 a one-sided test was used, with the alternative hypothesis specifying that 
more than 50% opposed switching to semesters. The p-value for the test is 0.023. 
Therefore, using a level of significance of 0.05, the p-value is less than the level of sig- 
nificance. The conclusion can be worded in these ways: 


m Reject the null hypothesis. 
m Accept the alternative hypothesis. 
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m The true population proportion opposing the switch to semesters is significantly 
greater than 0.50. 


Therefore, the administration should think twice about making the switch. E 


EXAMPLE 3 Teen Drug Survey 


CONTINUED For Example 3 the alternative hypothesis was two-sided. The p-value was essentially O, 
so no matter what level of significance is used, within reason, the p-value is less. The 
conclusion can be worded in these ways: 


m Reject the null hypothesis. 
m Accept the alternative hypothesis. 


m The true proportion of teens living with both parents in the population represented by 
the survey is significantly different from 0.67. 


m The proportion of teens living with both parents in the population represented by the 
survey is significantly different than the proportion of teens living with both parents 
in the United States. 


Notice that the conclusion doesn’t state the direction of the difference. A confidence in- 
terval could be found to estimate the proportion of teens living with both parents for 
the population represented by the survey, and that would provide the direction of the 
difference. a 


22.4 What Can Go Wrong: The Two Types of Errors 


Any time decisions are made in the face of uncertainty, mistakes can be made. In 
testing hypotheses, there are two potential decisions or choices and each one brings 
with it the possibility that a mistake, or error, has been made. 


The Courtroom Analogy 


It is important to consider the seriousness of the two possible errors before making 
a choice. Let’s use the courtroom analogy as an illustration. Here are the potential 
choices and the error that could accompany each: 


Choice 1: We cannot rule out that the defendant is innocent, so he or she is set 
free without penalty. 


Potential error: A criminal has been erroneously freed. 

Choice 2: We believe there is enough evidence to conclude that the defendant 
is guilty. 

Potential error: An innocent person is falsely convicted and penalized and the 
guilty party remains free. 


Although the seriousness of the two potential errors depends on the seriousness of 
the crime and the punishment, choice 2 is usually seen as more serious. Not only is 
an innocent person punished but a guilty one remains completely free and the case 
is closed. 
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A Medical Analogy: False Positive versus False Negative 


As another example, consider a medical scenario in which you are tested for a dis- 
ease. Most tests for diseases are not 100% accurate. In reading your results, the lab 
technician or physician must make a choice between two hypotheses: 


Null hypothesis: You do not have the disease. 


Alternative hypothesis: You have the disease. 
Notice the possible errors engendered by each of these decisions: 


Choice 1: In the opinion of the medical practitioner, you are healthy. The test 
result was weak enough to be called “negative” for the disease. 


Potential error: You are actually diseased but have been told you are not. In 
other words, your test was a false negative. 


Choice 2: In the opinion of the medical practitioner, you are diseased. The test 
results were strong enough to be called “positive” for the disease. 


Potential error: You are actually healthy but have been told you are diseased. 
In other words, your test was a false positive. 


Which error is more serious in medical testing? It depends on the disease and on the 
consequences of a negative or positive test result. For instance, a false negative in a 
screening test for cancer could lead to a fatal delay in treatment, whereas a false pos- 
itive would probably only lead to a retest and short-term concern. 

A more troublesome example occurs in testing for HIV, where it is very serious 
to report a false negative and tell someone they are not infected when in truth they 
are infected. However, it is quite frightening for the patient to be given a false posi- 
tive test for HIV—that is, the patient is really healthy but is told otherwise. HIV test- 
ing tends to err on the side of erroneous false positives. However, before being given 
the results, people who test positive for HIV with an inexpensive screening test are 
generally retested using an extremely accurate but more expensive test. Unfortu- 
nately, those who test negative are generally not retested, so it is important for the 
initial test to have a very low false negative rate. 


The Two Types of Error in Hypothesis Testing 


You can see that there is a trade-off between the two types of errors just discussed. 
Being too lenient in making decisions in one direction or the other is not wise. De- 
termining the best direction in which to err depends on the situation and the conse- 
quences of each type of potential error. 

The courtroom and medical analogies are not much different from the scenario 
encountered in hypothesis testing in general. Two types of errors can be made. A 
type 1 error can only be made if the null hypothesis is actually true. A type 2 error 
can only be made if the alternative hypothesis is actually true. Figure 22.1 illustrates 
the errors that can happen in the courtroom, in medical testing, and in hypothesis 
testing. 


Figure 22.1 

Potential errors in the 
courtroom, in medical 
testing, and in hypothe- 
sis testing 


EXAMPLE 1 
CONTINUED 
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True State of Nature 


Innocent, Healthy 
Null Hypothesis 


Guilty, Diseased 


Decision Made Alternative Hypothesis 


Not guilty Undeserved freedom 

Healthy Correct | False negative 

Don't reject null © Type 2 error 
hypothesis 

Guilty Undeserved punishment 

Diseased | False positive Correct 

Accept alternative Type 1 error © 
hypothesis 


In our example of switching from quarters to semesters, if the university administrators 
were to make a type 1 error, the decision would correspond to a false positive. The 
administrators would be creating a false alarm, in which they stopped their planned 
change for no good reason. A type 2 error would correspond to a false negative, in 
which the administrators would have decided there isn’t a problem when there really is 
one. E 


Probabilities Associated with Type 1 and Type 2 Errors 


It would be nice if we could specify the probability that we were indeed making an 
error with each potential decision. We could then weigh the consequence of the er- 
ror against its probability. Unfortunately, in most cases, we can only specify the con- 
ditional probability of making a type 1 error, given that the null hypothesis is true. 
That probability is the level of significance, usually set at 0.05. 


Level of Significance and Type 1 Errors 

Remember that two conditions have to hold for a type 1 error to be made. First, it 
can only happen when the null hypothesis is true. Second, the data must convince us 
that the alternative hypothesis is true. For this to happen, the p-value must be less 
than or equal to the level of significance, usually set at 0.05. What is the probability 
that the p-value will be 0.05 or less by chance? If the null hypothesis is true, that 
probability is in fact 0.05. No matter what level of significance is chosen, the prob- 
ability that the p-value will be that small or smaller is equal to the level of signifi- 
cance (when the null hypothesis is true). Let’s restate this information about the 
probability of making a type | error: 


If the null hypothesis is true, the probability of making a type 1 error is equal 
to the stated level of significance, usually 0.05. If the null hypothesis is not 
true, a type l error cannot be made. 
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Type 2 Errors 

Notice that making a type 2 error is possible only if the alternative hypothesis is true. 
A type 2 error is made if the alternative hypothesis is true, but you fail to choose it. 
The probability of doing that depends on exactly which part of the alternative hy- 
pothesis is true, so that computing the probability of making a type 2 error is not 
feasible. 

For instance, university administrators in our hypothetical example did not spec- 
ify what percentage of students would have to oppose switching to semesters in or- 
der for the alternative hypothesis to hold. They merely specified that it would be over 
half. If only 51% of the population of students oppose the switch, then a sample of 
400 students could easily result in fewer than half of the sample in opposition, in 
which case the null hypothesis would not be rejected even though it is false. How- 
ever, if 80% of the students in the population oppose the switch, the sample propor- 
tion would almost surely be large enough to convince the administration that the 
alternative hypothesis holds. In the first case (51% in opposition), it would be very 
easy to make a type 2 error, whereas in the second case (80% in opposition), it would 
be almost impossible. Yet, both are legitimate cases where the alternative hypothesis 
is true. That’s why we can’t specify one single value for the probability of making a 
type 2 error. 


The Power of a Test 
Researchers should never conclude that the null hypothesis is true just because the 
data failed to provide enough evidence to reject it. There is no control over the prob- 
ability that an erroneous decision has been made in that situation. The power of a 
test is the probability of making the correct decision when the alternative hypothe- 
sis is true. It should be clear to you that the power of the administrators to detect that 
a majority opposes switching to semesters will be much higher if that majority con- 
stitutes 80% of the population than if it constitutes just 51% of the population. In 
other words, if the population value falls close to the value specified as the null hy- 
pothesis, then it may be difficult to get enough evidence from the sample to conclu- 
sively choose the alternative hypothesis. There will be a relative high probability of 
making a type 2 error, and the test will have relatively low power in that case. 
Sometimes news reports, especially in science magazines, will insightfully note 
that a study may have failed to find a relationship between two variables because the 
test had such low power. As we will see in Chapter 24, this is a common conse- 
quence of conducting research with samples that are too small, but it is one that is 
often overlooked in media reports. 


When to Reject the Null Hypothesis 


When you read the results of research in a journal, you often are presented with a 
p-value, and the conclusion is left to you. In deciding whether a null hypothesis 
should be rejected, you should consider the consequences of the two potential types 
of errors. If you think the consequences of a type 1 error are very serious, then you 
should only reject the null hypothesis and accept the alternative hypothesis if the 
p-value is very small. Conversely, if you think a type 2 error is more serious, you 


CASE STUDY 22.1 
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should be willing to reject the null hypothesis with a moderately large p-value, typ- 
ically 0.05 to 0.10. 


Testing for the Existence of Extrasensory Perception 


For centuries, people have reported experiences of knowing or communicating in- 
formation that cannot have been transmitted through normal sensory channels. There 
have also been numerous reports of people seeing specific events in visions and 
dreams that then came to pass. These phenomena are collectively termed extrasen- 
sory perception. 

In Chapter 18, we learned that it is easy to be fooled by underestimating the 
probability that weird events could have happened just by chance. Therefore, many 
of these reported episodes of extrasensory perception are probably explainable in 
terms of chance phenomena and coincidences. 

Scientists have been conducting experiments to test extrasensory perception in 
the laboratory for several decades. As with many experiments, those performed in the 
laboratory lack the “ecological validity” of the reported anecdotes, but they have the 
advantage that the results can be quantified and studied using statistics. 


Description of the Experiments 


In this case study, we focus on one branch of research that has been investigated for 
a number of years under very carefully controlled conditions. The experiments are 
reported in detail by Honorton and colleagues (1990) and have also been summa- 
rized by Utts (1991, 1996), Bem and Honorton (1994), and Bem et al. (2002). 

The experiments use an experimental setup called the ganzfeld procedure. This 
involves four individuals, two of whom are participants and two, researchers. One of 
the two participants is designated as the sender and the other as the receiver. One of 
the two researchers is designated as the experimenter and the other as the assistant. 

Each session of this experiment produces a single yes or no data value and takes 
over an hour to complete. The session begins by sequestering both the receiver and 
the sender in separate, sound-isolated, electrically shielded rooms. The receiver 
wears headphones over which “white noise” (which sounds like a continuous hiss- 
ing sound) is played. He or she is also looking into a red light, with halved Ping- 
Pong balls taped over the eyes to produce a uniform visual field. The term ganzfeld 
means “total field” and is derived from this visual experience. The reasoning behind 
this setup is that the senses will be open and expecting meaningful input, but noth- 
ing in the room will be providing such input. The mind may therefore look elsewhere 
for input. 

Meanwhile, in another room, the sender is looking at either a still picture (a 
“static target”) or a short video (a “dynamic target”) on a television screen and at- 
tempting to “send” the image (or images) to the receiver. Here is an example of a de- 
scription of a static target, from Honorton and colleagues (1990, p. 123): 


Flying Eagle. An eagle with outstretched wings is about to land on a perch; its 
claws are extended. The eagle’s head is white and its wings and body are black. 
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The receiver has a microphone into which he or she is supposed to provide a con- 
tinuous monologue about what images or thoughts are present. The receiver has no 
idea what kind of picture the sender might be watching. Here is part of what the re- 
ceiver said during the monologue for the session in which the “Flying Eagle” was 
the target (from Honorton et al., 1990, p. 123): 


A black bird. I see a dark shape of a black bird with a very pointed beak with 
his wings down. . . . Almost needle-like beak. . . . Something that would fly or 
is flying . . . like a big parrot with long feathers on a perch. Lots of feathers, 
tail feathers, long, long, long. . . . Flying, a big huge, huge eagle. The wings of 
an eagle spread out. . .. The head of an eagle. White head and dark 

feathers. . . . The bottom of a bird. 


The experimenter monitors the whole procedure and listens to the receiver’s 
monologue. The assistant has one task only, to randomly select the material the 
sender will view on the television screen. This is done by using a computer, and no 
one knows the identity of the “target” selected except the sender. For the particular 
set of experiments we will examine, there were 160 possibilities, of which half were 
static targets and half were dynamic targets. 


Quantifying the Results 


Although it may seem as though the receiver provided an excellent description of the 
Flying Eagle target, remember that the quote given was only a small part of what the 
receiver said. So far, there is no quantifiable result. Such results are secured only at 
the end of the session. 

To provide results that can be analyzed statistically, the results must be expressed 
in terms that can be compared with chance. To provide a comparison to chance, a 
single categorical measure is taken. Three “decoy” targets are chosen from the set. 
They are of the same type as the real target, either static or dynamic. Note that due 
to the random selection, any of the four targets (the real one and the three decoys) 
could equally have been chosen to be the real target at the beginning of the session. 

At the end of the session the receiver is shown the four targets. He or she is then 
asked to decide, on the basis of the monologue provided, which one the sender was 
watching. If the receiver picks the right one, the session is a success. The example 
provided, in which the target was a picture of an eagle, was indeed a success. 


The Null and Alternative Hypotheses 


By chance, one-fourth or 25%, of the sessions should result in a success. Therefore, 
the statistical question for this hypothesis test is: Did the sessions result in signifi- 
cantly more than 25% successes? 

The hypotheses being tested are thus as follows: 


Null hypothesis: There is no extrasensory perception, and the results are due to 
chance guessing. The probability of a successful session is 0.25. 


Alternative hypothesis: The results are not due to chance guessing. The proba- 
bility of a successful session is higher than 0.25. 
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The Results 


Honorton and his colleagues, who reported their results in the Journal of Parapsy- 
chology in 1990, ran several experiments, using the setup described, between 1983 
and 1989, when the lab closed. There were a total of 355 sessions, of which 122 were 
successful. 

The sample proportion of successful results is 122/355 = 0.344, or 34.4%. If the 
null hypothesis were true, the true proportion would be 0.25 and the standard devi- 
ation would be: 


standard deviation = square root of [(0.25)(0.75)/355] = 0.023 
Therefore, the test statistic is 
standardized score = z-score = (0.344 —0.25)/0.023 = 4.09 


It is obvious that such a large standardized score would rarely occur by chance, and 
indeed the p-value is about 0.00005. In other words, if chance alone were operating, 
we would see results of this magnitude about 5 times in every 100,000 such experi- 
ments. Therefore, we would certainly declare this to be a statistically significant 
result. 

Carl Sagan has said that “exceptional claims require exceptional proof,’ and for 
some nonbelievers, even a p-value this small may not increase their personal proba- 
bility that extrasensory perception exists. However, as with any area of science, such 
a striking result should be taken as evidence that something out of the ordinary is 
definitely happening in these experiments. Further experiments have continued to 
achieve similar results (Bem et al., 2002). 


For Those Who Like Formulas 


The formulas for the material in this chapter are presented along with other formu- 
las for testing hypotheses at the end of Chapter 23. 


Exercises 


Asterisked (*) exercises are included in the Solutions at the back of the book. 


1. When we revisited Case Study 6.4 in Chapter 21, we learned that a 95% confi- 
dence interval for the difference in years of education for mothers who did not 
smoke compared with those who did extended from 0.15 to 1.19 years, with 
higher education for those who did not smoke. Suppose we had used the data to 
construct a test instead of a confidence interval, to see if one group in the pop- 
ulation was more educated than the other. What would the null and alternative 
hypotheses have been for the test? 


2. Refer to Exercise 1. If we had conducted the hypothesis test, the resulting 
p-value would be 0.01. Explain what the p-value represents for this example. 
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*3, 


4. 


tS, 


Refer to Case Study 21.1, in which women were randomly assigned to receive 
either a placebo or calcium and severity of premenstrual syndrome (PMS) 
symptoms was measured. 


a. What are the null and alternative hypotheses tested in this experiment? 


*b. The researchers concluded that calcium helped reduce the severity of PMS 
symptoms. Which type of error could they have made? 


c. What would be the consequences of making a type | error in this experi- 
ment? What would be the consequences of making a type 2 error? 


The journal article reporting the experiment described in Case Study 21.1 (see 
Thys-Jacobs et al., 1998, in Chapter 21) compared the placebo and calcium- 
treated groups for a number of PMS symptoms, both before the treatment began 
(baseline) and in the third cycle. A p-value was given for each comparison. For 
each of the following comparisons, state the null and alternative hypotheses and 
the appropriate conclusion: 


a. Baseline, mood swings, p-value = 0.484 
b. Third cycle, mood swings, p-value = 0.002 
c. Third cycle, insomnia, p-value = 0.213 


An article in Science News reported on a study to compare treatments for re- 
ducing cocaine use. Part of the results are 


short-term psychotherapy that offers cocaine abusers practical strategies 
for maintaining abstinence sparks a marked drop in their overall cocaine 
use. . . . In contrast, brief treatment with desipramine—a drug thought by 
some researchers to reduce cocaine cravings—generates much weaker 
drops in cocaine use. (Bower, 24, 31 December 1994, p. 421) 


*a, The researchers were obviously interested in comparing the rates of cocaine 
use following treatment with the two methods. State the null and alternative 
hypotheses for this situation. 

b. Explain what the two types of error could be for this situation and what their 
consequences would be. 

*c, Although no p-value is given, the researchers presumably concluded that the 
psychotherapy treatment was superior to the drug treatment. Which type of 
error could they have made? 


. State the null and alternative hypotheses for each of the following potential re- 


search questions: 

a. Does working 5 hours a day or more at a computer contribute to deteriorat- 
ing eyesight? 

b. Does placing babies in an incubator during infancy lead to claustrophobia in 
adult life? 


c. Does placing plants in an office lead to fewer sick days? 


. For each of the situations in Exercise 6, explain the two errors that could be 


made and what the consequences would be. 


*10. 


*12. 


13. 


14. 
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. Explain why we can specify the probability of making a type | error, given that 


the null hypothesis is true, but we cannot specify the probability of making a 
type 2 error, given that the alternative hypothesis is true. 


. Compute a 95% confidence interval for the probability of a successful session 


in the ganzfeld studies reported in Case Study 22.1. 


Specify what a type | and a type 2 error would be for the ganzfeld studies re- 
ported in Case Study 22.1. 


. Given the convention of declaring that a result is statistically significant if the 


p-value is 0.05 or less, what decision would be made concerning the null and al- 
ternative hypotheses in each of the following cases? Be explicit about the word- 
ing of the decision. 


a. p-value = 0.35 
b. p-value = 0.04 


In previous chapters, we learned that researchers have discovered a link between 
vertex baldness and heart attacks in men. 


a. State the null hypothesis and the alternative hypothesis used to investigate 
whether there is such a relationship. 


b. Discuss what would constitute a type | error in this study. 
*e. Discuss what would constitute a type 2 error in this study. 


A report in the Davis (CA) Enterprise (6 April 1994, p. A-11) was headlined, 
“Highly educated people are less likely to develop Alzheimer’s disease, a new 
study suggests.” 


a. State the null and alternative hypotheses the researchers would have used in 
this study. 


b. What do you think the headline is implying about statistical significance? 
Restate the headline in terms of statistical significance. 


c. Was this a one-sided test or a two-sided test? Explain. 


Suppose that a study is designed to choose between the hypotheses: 


Null hypothesis: Population proportion is 0.25. 

Alternative hypothesis: Population proportion is higher than 0.25. 
On the basis of a sample of size 500, the sample proportion is 0.29. The stan- 
dard deviation for the potential sample proportions in this case is about 0.02. 


a. Compute the standardized score corresponding to the sample proportion of 
0.29, assuming the null hypothesis is true. 


b. What is the percentile for the standardized score computed in part a? 
c. What is the p-value for the test? 


d. Based on the results of parts a to c, make a conclusion. Be explicit about the 
wording of your conclusion and justify your answer. 

e. To compute the standardized score in part a, you assumed the null hypothe- 
sis was true. Explain why you could not compute a standardized score under 
the assumption that the alternative hypothesis was true. 
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15. 


*16. 


*17. 


18. 


19. 


An article in the Los Angeles Times (24 December 1994, p. A16) announced that 
a new test for detecting HIV had been approved by the Food and Drug Admin- 
istration (FDA). The test requires the person to send a saliva sample to a lab. The 
article described the accuracy of the test as follows: 


The FDA cautioned that the saliva test may miss one or two infected indi- 
viduals per 100 infected people tested, and may also result in false posi- 
tives at the same rate in the uninfected. For this reason, the agency 
recommended that those who test positive by saliva undergo confirmatory 
blood tests to establish true infection. 


a. Do you think it would be wise to use this saliva test to screen blood donated 
at a blood bank, as long as those who test positive were retested as suggested 
by the FDA? Explain your reasoning. 


b. Suppose that 10,000 students at a university were all tested with this saliva 
test and that, in truth, 100 of them were infected. Further, suppose the false 
positive and false negative rates were actually both 1 in 100 for this group. 
If someone tests positive, what is the probability that he or she is infected? 


Consider medical tests in which the null hypothesis is that the patient does not 
have the disease and the alternative hypothesis is that he or she does. 


*ą. Give an example of a medical situation in which a type 1 error would be 
more serious. 


*b. Give an example of a medical situation in which a type 2 error would be 
more serious. 


Many researchers decide to reject the null hypothesis as long as the p-value is 
0.05 or less. In a testing situation for which a type 2 error is much more serious 
than a type 1 error, should researchers require a higher or a lower p-value in or- 
der to reject the null hypothesis? Explain your reasoning. 


In Case Study 1.2 and in Chapters 12 and 13, we examined a study showing that 
there appears to be a relationship between taking aspirin and incidence of heart 
attack. The null hypothesis in that study would be that there is no relationship 
between the two variables, and the alternative would be that there is a relation- 
ship. Explain what a type 1 error and a type 2 error would be for the study and 
what the consequences of each type would be for the public. 


In Case Study 1.1, Lee Salk did an experiment to see if hearing the sound of a 
human heartbeat would help infants gain weight during the first few days of life. 
By comparing weight gains for two sample groups of infants, he concluded that 
it did. One group listened to a heartbeat and the other did not. 


a. What are the null and alternative hypotheses for this study? 
b. Was this a one-sided test or a two-sided test? Explain. 

c. What would a type 1 and type 2 error be for this study? 

d 


. Given the conclusion made by Dr. Salk, explain which error he could possi- 
bly have committed and which one he could not have committed. 
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e. Rather than simply knowing whether there was a difference in average 
weight gains for the two groups, what statistical technique would have pro- 
vided additional information? 


Ve.) *20. In Original Source 17, “Monkeys reject unequal pay,” the researchers found that 


female monkeys were less willing to participate in an exchange of a small token 
for a piece of cucumber if they noticed that another monkey got a better deal. In 
one part of the experiment, a condition called “effort control,” the monkey wit- 
nessed another monkey being given a grape without having to give up a token 
(or anything else) in return. The first monkey was then cued to perform as usual, 
giving up a small token in return for a piece of cucumber. The proportion of 
times the monkeys were willing to do so was recorded, and the results are shown 
in Figure 1 (page 297) of the report. This “effort control” condition was com- 
pared with the “equality condition” in which the first monkey was also required 
to exchange a token for cucumber. For this part of the study, the researchers re- 
ported, “Despite the small number of subjects, the overall exchange tendency 
varied significantly . . . comparing effort controls with equality tests, P < 0.05” 
(p 298). (The symbol < means “less than.”) 


a. The participants in the study were five capuchin monkeys. To what popula- 
tion do you think the results apply? (Note that there is no right or wrong 
answer.) 


b. The researchers were interested in comparing the proportion of times the 
monkeys would cooperate by trading the token for the cucumber after ob- 
serving another monkey doing the same, versus observing another monkey 
receiving a free grape. The null hypothesis in this test is that the proportion 
of times monkeys in the population would cooperate after observing either 
of the two conditions is the same. What is the alternative hypothesis? 


c. Based on the quote, what do you know about the p-value? 
*d. Based on the quote, what level of significance are the researchers using? 


e. What conclusion would be made? Write your conclusion in statistical lan- 
guage and in the context of the example. 


Ve.) 21. In Original Source 2 on the CD, “Development and initial validation of the 


Hangover Symptoms Scale: Prevalence and correlates of hangover symptoms 
in college students,” one of the questions was how many times respondents 
had experienced at least one hangover symptom in the past year. Table 3 of the 
paper (page 1445) shows that out of all 1216 respondents, 40% answered that 
it was two or fewer times. Suppose the researchers are interested in the pro- 
portion of the population who would answer that it was two or fewer times. 
Can they conclude that this population proportion is significantly less than 
half (50% or 0.5)? Go through the four steps of hypothesis testing for this 
situation. 


. Refer to the previous exercise. Repeat the test for women only. Refer to Table 3 


on page 1445 of the paper for the data. 
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Mini-Projects 


1. Construct a situation for which you can test null and alternative hypotheses for 
a population proportion. For example, you could see whether you can flip a coin 
in a manner so as to bias it in favor of heads. Or you could conduct an ESP test 
in which you ask someone to guess the suits in a deck of cards. (To do the lat- 
ter experiment properly, you must replace the card and shuffle each time so you 
don’t change the probability of each suit by having only a partial deck, and you 
must separate the “sender” and “receiver” to rule out normal communication.) 
Collect data for a sample of at least size 100. Carry out the test. Make sure you 
follow the four steps given in Section 22.2, and be explicit about your hypothe- 
ses, your decision, and the reasoning behind your decision. 


2. Find two newspaper articles reporting on the results of studies with the follow- 
ing characteristics. First, find one that reports on a study that failed to find a re- 
lationship. Next, find one that reports on a study that did find a relationship. For 
each study, state what hypotheses you think the researchers were trying to test. 
Then explain what you think the results really imply compared with what is im- 
plied in the newspaper reports for each study. 
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Hypothesis Testing— 
Examples and Case Studies 


Thought Questions 


1. 


In Chapter 21, we examined a study showing that the difference in sample means 
for weight loss based on dieting only versus exercising only was 3.2 kg. That same 
study showed that the difference in average amount of fat weight (as opposed 
to muscle weight) lost was 1.8 kg and that the corresponding standard error was 
0.83 kg. Suppose the means are actually equal, so that the mean difference in fat 
that would be lost for the populations is actually zero. What is the standardized score 
corresponding to the observed difference of 1.8 kg? Would you expect to see a stan- 
dardized score that large or larger very often? 


. In the journal article reported on in Case Study 6.4, comparing IQs for children of 


smokers and nonsmokers, one of the statements made was, “after control for con- 
founding background variables, the average difference [in mean IQs] observed at 12 
and 24 months was 2.59 points (95% Cl: —3.03, 8.20; P= 0.37)” (Olds et al., 1994, 
p. 223). The reported value of 0.37 is the p-value. What do you think are the null and 
alternative hypotheses being tested? 


. In chi-square tests for two categorical variables, introduced in Chapter 13, we were 


interested in whether a relationship observed in a sample reflected a real relationship 
in the population. What are the null and alternative hypotheses? 


. In Chapter 13, we found a statistically significant relationship between smoking (yes 


or no) and time to pregnancy (one cycle or more than one cycle). Explain what the 
type 1 and type 2 errors would be for this situation and the consequences of mak- 
ing each type of error. 
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23.1 How Hypothesis Tests Are Reported in the News 


EXAMPLE 1 


When you read the results of hypothesis tests in a newspaper, you are given very lit- 
tle information about the details of the test. It is therefore important for you to re- 
member the steps occurring behind the scenes, so you can translate what you read into 
the bigger picture. Remember that the basic steps to hypothesis testing are similar in 
any setting; they are stated slightly more concisely here than in Chapters 13 and 22: 


Step 1: Determine the null and alternative hypotheses. 
Step 2: Collect and summarize the data into a test statistic. 
Step 3: Use the test statistic to determine the p-value. 


Step 4: The result is statistically significant if the p-value is less than or 
equal to the level of significance. 


In the presentation of most research results in the media, you are simply told the re- 
sults of step 4, which isn’t usually even presented in statistical language. If the re- 
sults are statistically significant, you are told that a relationship has been found 
between two variables or that a difference has been found between two groups. If the 
results are not statistically significant, you may simply be told that no relationship 
or difference was found. In Chapter 24, we will revisit the problems that can arise if 
you are only told whether a statistically significant difference emerged from the re- 
search, but not told how large the difference was or how many participants there 
were in the study. 


A study, which will be examined in further detail in Chapter 27, found that cranberry 
juice lived up to its popular reputation of preventing bladder infections, at least in older 
women. Here is a newspaper article reporting on the study (Davis [CA] Enterprise, 
9 March 1994, p. A9): 


Chicago (AP) A scientific study has proven what many women have long sus- 
pected: Cranberry juice helps protect against bladder infections. Researchers found 
that elderly women who drank 10 ounces of a juice drink containing cranberry 
juice each day had less than half as many urinary tract infections as those who con- 
sumed a look-alike drink without cranberry juice. The study, which appeared today 
in the Journal of the American Medical Association, was funded by Ocean Spray 
Cranberries, Inc., but the company had no role in the study's design, analysis or in- 
terpretation, JAMA said. “This is the first demonstration that cranberry juice can re- 
duce the presence of bacteria in the urine in humans,” said lead researcher Dr. Jerry 
Avorn, a specialist in medication for the elderly at Harvard Medical School. 


Reading the article, you should be able to determine the null and alternative hy- 
potheses and the conclusion. But there is no way for you to determine the value of the 
test statistic or the p-value. The study is attempting to compare the odds of getting an 
infection for the population of elderly women if they were to follow one of two regimes: 
10 ounces of cranberry juice per day or 10 ounces of a placebo drink. The null hypoth- 
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esis is that the odds ratio is 1; that is, the odds are the same for both groups. The alter- 
native hypothesis is that the odds of infection are higher for the group drinking the 
placebo. The article indicates that the odds ratio is under 50%. In fact, the original arti- 
cle (Avorn et al., 1994) sets it at 42% and reports that the associated p-value is 0.004. 
The newspaper article captured the most important aspect of the research, but did not 
make it clear that the p-value was extremely low. a 


23.2 Testing Hypotheses about Proportions and Means 


EXAMPLE 2 


For many situations, performing the computations necessary to find the test statistic 
—and thus the p-value—requires a level of statistical expertise beyond the scope of 
this text. However, for some simple situations—such as those involving a propor- 
tion, a mean, or the difference between two means—you already have all the neces- 
sary tools to understand the steps that need to be taken in doing such computations. 


The Standardized Score and the p-Value 


If the null and alternative hypotheses can be expressed in terms of a population pro- 
portion, mean, or difference between two means, and if the sample sizes are large, 
then the test statistic is simply the standardized score associated with the sample pro- 
portion, mean, or difference between two means. The standardized score is computed 
assuming the null hypothesis is true. The p-value is found from a table of percentiles 
for standardized scores (such as Table 8.1) or with the help of a computer program 
like Excel. It gives us the percentile at which a particular sample would fall if the null 
hypothesis represented the truth. In Chapter 22, we presented the details for how this 
works for one proportion. Rather than providing detailed formulas for the other situ- 
ations, we will reexamine two previous examples to illustrate how this works. These 
examples should help you understand the ideas behind hypothesis testing in general. 


A Two-Sided Hypothesis Test 


for the Difference in Two Means 


Weight Loss for Diet versus Exercise 


In Chapter 21, we found a confidence interval for the mean difference in weight loss for 
men who participated in dieting only versus exercising only for the period of a year. 
However, weight loss occurs in two forms: lost fat and lost muscle. Did the dieters also 
lose more fat than the exercisers? These were the sample results for amount of fat lost 
after 1 year: 


Diet Only Exercise Only 

sample mean = 5.9 kg sample mean = 4.1 kg 

sample standard deviation = 4.1 kg sample standard deviation = 3.7 kg 

number of participants = n = 42 number of participants = n = 47 

standard error = SEM, = standard error = SEM, = 
4.1/V42 = 0.633 3.7/V47 = 0.540 


measure of uncertainty = square root of [(0.633)? + (0.540)?] = 0.83 
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Here are the steps required to determine if there is a difference in average fat that 
would be lost for the two methods if carried out in the population represented by 
these men. 


Step 1. Determine the null and alternative hypotheses. 


Null hypothesis: There is no difference in average fat lost in the population for the 
two methods. The population mean difference is zero. 


Alternative hypothesis: There is a difference in average fat lost in the population for 
the two methods. The population mean difference is not zero. 


Notice that we do not specify which method has higher fat loss in the population. In this 
study, the researchers were simply trying to ascertain whether there was a difference. 
They did not specify in advance which method they thought would lead to higher fat 
loss. 

Remember that when the alternative hypothesis includes a possible difference in ei- 
ther direction, the test is called a two-sided, or two-tailed, hypothesis test. In this 
example, it made sense for the researchers to construct a two-sided test because they 
had no preconceived idea of which method would be more effective for fat loss. They 
were simply interested in knowing whether there was a difference. When the test is two- 
sided, the p-value must account for possible chance differences in both “tails” or both 
directions. In step 3 for this example, our computation of the p-value will take both pos- 
sible directions into account. 


Step 2. Collect and summarize the data into a test statistic. The test statistic is the stan- 
dardized score for the sample value when the null hypothesis is true. If there is no dif- 
ference in the two methods, then the mean population difference is zero. The sample 
value is the observed difference in the two sample means, namely 5.9 — 4.1 = 1.8 kg. 
The measure of uncertainty, or standard error for this difference, which we just com- 
puted, is 0.83. Thus, the test statistic is 


standardized score = z-score = (1.8 — 0)/0.83 = 2.17 


Step 3. Use the test statistic to determine the p-value. How extreme is this standard- 
ized score? First, we need to define extreme to mean both directions because a stan- 
dardized score of —2.17 would have been equally informative against the null 
hypothesis. From Table 8.1, we see that 2.17 is between the 98th and 99th percentiles, 
at about the 98.5th percentile. Using software, | found that this is exactly right, so the 
probability of a result of 2.17 or higher is 0.015. This is also the probability of an extreme 
result in the other direction—that is, below —2.17. Thus, the test statistic results in a 
p-value of 2(0.015), or 0.03. 


Step 4. The result is statistically significant if the p-value is less than or equal to the level 
of significance. 

Using the standard 0.05, the p-value of 0.03 indicates that the result is statistically 
significant. If there were really no difference between dieting and exercise as fat-loss 
methods, we would see such an extreme result only 3% of the time, or 3 times out of 
100. We prefer to believe that the truth does not lie with the null hypothesis. We con- 
clude that there is a statistically significant difference between what the average fat loss 
for the two methods would be in the population. We reject the null hypothesis. We ac- 
cept the alternative hypothesis. In the exercises, you will be given a chance to test 
whether the same thing holds true for lean body mass (muscle) weight loss. o 


EXAMPLE 3 
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A One-Sided Hypothesis Test for a Proportion 


Public Opinion about the President 


Throughout his presidency, Bill Clinton was plagued with questions about the integrity 
of his personal life. On May 16, 1994, Newsweek reported the results of a public opin- 
ion poll that asked: “From everything you know about Bill Clinton, does he have the 
honesty and integrity you expect in a president?” (p. 23). The poll surveyed 518 adults 
and 233, or 0.45 of them (clearly less than half), answered yes. Could Clinton's adver- 
saries conclude from this that only a minority (less than half) of the population of Amer- 
icans thought Clinton had the honesty and integrity to be president? Assume the poll 
took a simple random sample of American adults. (In truth, a multistage sample was 
used, but the results will be similar.) Here is how we would proceed with testing this 
question. 


Step 1. Determine the null and alternative hypotheses. 


Null hypothesis: There is no clear winning opinion on this issue; the proportions who 
would answer yes or no are each 0.50. 


Alternative hypothesis: Fewer than 0.50, or 50%, of the population would answer 
yes to this question. The majority does not think Clinton has the honesty and in- 
tegrity to be president. 


Notice that, unlike the previous example, this alternative hypothesis includes values on 
only one side of the null hypothesis. It does not include the possibility that more than 
50% would answer yes. If the data indicated an uneven split of opinion in that direc- 
tion, we would not reject the null hypothesis because the data would not be support- 
ing the alternative hypothesis, which is the direction of interest. Some researchers 
include the other direction as part of the null hypothesis, but others simply don’t include 
that possibility in either hypothesis, as illustrated in this example. 

Remember that when the alternative hypothesis includes values in one direction 
only, the test is called a one-sided, or one-tailed, hypothesis test. The p-value is 
computed using only the values in the direction specified in the alternative hypothesis, 
as we shall see for this example. 


Step 2. Collect and summarize the data into a test statistic. The test statistic is the stan- 
dardized score for the sample value when the null hypothesis is true. The sample value 
is 0.45. If the null hypothesis is true, the population proportion is 0.50. The corre- 
sponding standard deviation, assuming the null hypothesis is true, is 


standard deviation = square root of [(0.5)(0.5)/518] = 0.022 
standardized score = z-score = (0.45 — 0.50)/0.022 = —2.27 


Step 3. Use the test statistic to determine the p-value. The p-value is the probability of 
observing a standardized score of —2.27 or less, just by chance. From Table 8.1, we find 
that —2.27 is between the 1st and 2nd percentiles. In fact, using a more exact method, 
the statistical function NORMSDIST(—2.27) in Excel, the p-value is 0.0116. Notice that, 
unlike the two-sided hypothesis test, we do not double this value. If we were to find an 
extreme value in the other direction, a standardized score of 2.27 or more, we would 
not reject the null hypothesis. We are performing a one-sided hypothesis test and are 
only concerned with values in one tail of the normal curve when we find the p-value. In 
this example, it is the lower tail because those values would indicate that the true 
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proportion is /ess than the null hypothesis value of 0.50, thus supporting the alternative 
hypothesis. 


Step 4. The result is statistically significant if the p-value is less than or equal to the level 
of significance. Using the 0.05 criterion, we have found a statistically significant result. 
Therefore, we could conclude that the proportion of American adults in 1994 who be- 
lieved Bill Clinton had the honesty and integrity they expected in a president was signif- 
icantly less than a majority. Oo 


Text not available due to copyright restrictions 


Student’s t-Test 
We have already seen how standardized scores can be used to test the null hypothe- 
sis that two population means are equal, which is the situation when comparing HSS 
scores for the two groups. Thus, you may wonder why the test statistic in the pre- 
ceding quote was given as f rather than the usual standardized score notation of z. 
This is a technical issue that will be of little consequence when sample sizes are 
large. When the sample standard deviation is used in place of the population stan- 
dard deviation, the resulting spread of the possible standardized scores (the fre- 
quency curve) is a little bit higher than it is for standard normal scores. This 
discovery was first made in 1908 and reported in a paper authored by “A Student.” 
The author was actually William Gosset, but his employer, the Guinness Brewing 
Company, did not want it to be known that they were using statistical methods. Thus, 
the test based on this standardized score became known as “Student’s t-test.” 
Because the accuracy of the sample standard deviation improves as the sample 
sizes increase, the frequency curve for possible Student’s t values gets very close to 
the standard normal curve for large sample sizes. Therefore, the standard normal 
curve provides p-values that are close to the more exact values derived from the t 
curve. To find the exact p-value for a f statistic requires knowing the sample sizes, 
and using them to find the value of the degrees of freedom, a measure of how spread 
out the f frequency curve is for those sample sizes. For instance, in Example 2 of this 
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chapter the z-test statistic given as 2.17 is actually a t-test statistic, because the sam- 
ple standard deviations were used in its computation. (Population standard devia- 
tions are rarely known, as was the case here.) The appropriate value for the degrees 
of freedom (abbreviated by df) is (42 + 47 — 2) = 87. The p-value given in the ex- 
ample is 2(0.015) = 0.03. The exact p-value can be found using Excel or other soft- 
ware. The Excel command TDIST(2.17,87,2) asks for the p-value for a test statistic 
of t = 2.17, df = 87, and a two-tailed test. The result is 0.0327, very close to the 
p-value of 0.0300 found using the standard normal curve. 


F- Tests 

When more than two means are to be compared the most common method used is 
called analysis of variance, abbreviated as ANOVA. Test statistics are used to test the 
null hypothesis that a set of means are all equal versus the alternative hypothesis that 
at least one mean differs from the other. The letter F is used to denote the test sta- 
tistic in this situation, and the resulting frequency curve is called an F-distribution. 
This distribution has two different degrees of freedom values associated with it. 
Finding degrees of freedom and p-values in this situation is beyond the level of our 
discussion, but you should now be able to interpret the results when you read them. 


Text not available due to copyright restrictions 


Other Tests 

We have already discussed chi-square tests in Chapter 13. The symbol for a chi- 
square test statistic is the Greek letter “chi” with a square, written as y”. Other tests 
you may encounter are a test for whether or not a correlation is significantly differ- 
ent from 0, in which case the test statistic is just the correlation r, nonparametric 
tests such as the sign test, Wilcoxon test or Mann—Whitney test, and tests for specific 
parts of a regression equation. In most cases you will simply need to understand 
what the hypotheses of interest were, and how to interpret the reported p-value. 
Thus, it’s very important that you understand the generic meaning of a p-value. 
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23.3 Revisiting Case Studies: How Journals 
Present Hypothesis Tests 


CASE STUDY 6.1 
REVISITED 


Whereas newspapers and magazines tend to simply report the decision from hy- 
pothesis testing, as noted, journals tend to report p-values as well. This practice al- 
lows you to make your own decision, based on the severity of a type 1 error and the 
magnitude of the p-value. Newspaper reports leave you little choice but to accept the 
convention that the probability of a type 1 error is 5%. 

To demonstrate the way in which journals report the results of hypothesis tests, 
we return to some of our earlier case studies. 


Mozart, Relaxation, and Performance on Spatial Tasks 


This case study reported that listening to Mozart for 10 minutes appeared to increase 
results on the spatial reasoning part of an IQ test. There were three listening condi- 
tions—Mozart, a relaxation tape, and silence—and all subjects participated in all 
three conditions. 

Because three means were being compared, the appropriate test statistic to sum- 
marize the data is more complicated than those we have discussed. However, devel- 
oping the hypotheses and interpreting the p-value are basically the same. Here are 
the hypotheses: 


Null hypothesis: There are no differences in population mean spatial reasoning 
IQ scores after each of the three listening conditions. 


Alternative hypothesis: Population mean spatial reasoning IQ scores do differ 
for at least one of the conditions compared with the others. 


Notice that the researchers did not specify in advance which condition might result 
in higher IQ scores. 
Here is how the results were reported: 


A one-factor (listening condition) repeated measures analysis of variance . . . 

revealed that subjects performed better on the abstract/spatial reasoning tests 
after listening to Mozart than after listening to either the relaxation tape or to 
nothing (F[2,35] = 7.08, p = 0.002). (Rauscher et al., 14 October 1993, 

p. 611) 


Because the p-value for this test is only 0.002, we can clearly reject the null hy- 
pothesis, accept the alternative, and conclude that at least one of the means differs 
from the others. If there were no population differences, sample mean results would 
vary as much as the ones in this sample did, or more, only 2 times in 1000 (0.002). 

The researchers then reported the results, indicating that it was indeed the 
Mozart condition that resulted in higher scores than the others: 


The music condition differed significantly from both the relaxation and silence 
conditions (Scheffé’s t = 3.41, p = 0.002; t = 3.67, p = 0.0008, two-tailed, re- 
spectively). The relaxation and silence conditions did not differ (t = 0.795, 

p = 0.432, two-tailed). (Rauscher et al., 14 October 1993, p. 611) 


CASE STUDY 5.1 
REVISITED 


CASE STUDY 6.4 
REVISITED 
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Notice that three separate tests are being reported in this paragraph. Each pair 
of conditions has been compared. Significant differences, as determined by the ex- 
tremely small p-values, were found between the music and relaxation conditions 
(p-value = 0.002) and between the music and silence condition (p-value = 0.0008). 
The difference between the relaxation and silence condition, however, was not sta- 
tistically significant (p-value = 0.432). 


Quitting Smoking with Nicotine Patches 


This study compared the smoking cessation rates for smokers randomly assigned to 
use a nicotine patch versus a placebo patch. In the summary at the beginning of the 
journal article, the results were reported as follows: 


Higher smoking cessation rates were observed in the active nicotine patch 
group at 8 weeks (46.7% vs 20%) (P < .001) and at 1 year (27.5% vs 14.2%) 
(P = .011). (Hurt et al., 1994, p. 595) 


Two sets of hypotheses are being tested, one for the results after 8 weeks and one for 
the results after 1 year. In both cases, the hypotheses are 


Null hypothesis: The proportion of smokers in the population who would quit 
smoking using a nicotine patch and a placebo patch are the same. 


Alternative hypothesis: The proportion of smokers in the population who 
would quit smoking using a nicotine patch is higher than the proportion who 
would quit using a placebo patch. 


In both cases, the reported p-values are quite small: less than 0.001 for the difference 
after 8 weeks and equal to 0.011 for the difference after a year. Therefore, we would 
conclude that rates of quitting are significantly higher using a nicotine patch than us- 
ing a placebo patch after 8 weeks and after 1 year. 


Smoking During Pregnancy and Child’s IQ 


In this study, researchers investigated the impact of maternal smoking on subsequent 
IQ of the child at ages 1, 2, 3, and 4 years of age. Earlier, we reported some of the 
confidence intervals provided in the journal article reporting the study. Those confi- 
dence intervals were actually accompanied by p-values. Here is the complete re- 
porting of some of the results: 


Children born to women who smoked 10+ cigarettes per day during pregnancy 
had developmental quotients at 12 and 24 months of age that were 6.97 points 
lower (averaged across these two time points) than children born to women 
who did not smoke during pregnancy (95% CI: 1.62,12.31, P = .01); at 36 and 
48 months they were 9.44 points lower (95% CI: 4.52, 14.35, P = .0002). 
(Olds et al., 1994, p. 223) 


Notice that we are given more information in this report than in most because we are 
given both confidence interval and hypothesis testing results. This is excellent re- 
porting because, with this information, we can determine the magnitude of the ob- 
served effects instead of just whether they are statistically significant or not. 
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CASE STUDY 23.1 


Again, two sets of null and alternative hypotheses are being tested in the report, 
one set at 12 and 24 months (1 and 2 years) and another at 36 and 48 months (3 and 
4 years of age). In both cases, the hypotheses are 


Null hypothesis: The mean IQ scores for children whose mothers smoke 10 or 
more cigarettes a day during pregnancy are the same as the mean for those 
whose mothers do not smoke, in populations similar to the one from which this 
sample was drawn. 


Alternative hypothesis: The mean IQ scores for children whose mothers smoke 
10 or more cigarettes a day during pregnancy are not the same as the mean for 
those whose mothers do not smoke, in populations similar to the one from 
which this sample was drawn. 


This is a two-tailed test because the researchers included the possibility that the 
mean IQ score could actually be higher for those whose mothers smoke. The confi- 
dence interval provides the evidence of the direction in which the difference falls. 
The p-value simply tells us that there is a statistically significant difference. 


An Interpretation of a p-Value Not Fit to Print 


In an article entitled, “Probability experts may decide Pennsylvania vote,” the New 
York Times (Passell, 11 April 1994, p. A15) reported on the use of statistics to try to 
decide whether there had been fraud in a special election held in Philadelphia. Un- 
fortunately, the newspaper account made a common mistake, misinterpreting a 
p-value to be the probability that the results could be explained by chance. The con- 
sequence was that readers who did not know how to spot the error would have been 
led to think that the election probably was a fraud. 

It all started with the death of a state senator from Pennsylvania’s Second Sen- 
ate District. A special election was held to fill the seat until the end of the unexpired 
term. The Republican candidate, Bruce Marks, beat the Democratic candidate, 
William Stinson, in the voting booth but lost the election because Stinson received 
sO many more votes in absentee ballots. The results in the voting booth were very 
close, with 19,691 votes for Mr. Marks and 19,127 votes for Mr. Stinson. But the ab- 
sentee ballots were not at all close, with only 366 votes for Mr. Marks and 1391 
votes for Mr. Stinson. 

The Republicans charged that the election was fraudulent and asked that the 
courts examine whether the absentee ballot votes could be discounted on the basis 
of suspicion of fraud. In February 1994, 3 months after the election, Philadelphia 
Federal District Court Judge Clarence Newcomer disqualified all absentee ballots 
and overturned the election. The ruling was appealed, and statisticians were hired to 
help sort out what might have happened. 

One of the statistical experts, Orley Ashenfelter, decided to examine previous sen- 
atorial elections in Philadelphia to determine the relationship between votes cast in the 
voting booth and those cast by absentee ballot. He computed the difference between 
the Republican and Democratic votes for those who voted in the voting booth, and 
then for those who voted by absentee ballot. He found there was a positive correlation 
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between the voting booth difference and the absentee ballot difference. Using data 
from 21 previous elections, he calculated a regression equation to predict one from the 
other. Using his equation, the difference in votes for the two parties by absentee bal- 
lot could be predicted from knowing the difference in votes in the voting booth. 

Ashenfelter then used his equation to predict what should have happened in the 
special election in dispute. There was a difference of 19,691 — 19,127 = 564 votes 
(in favor of the Republicans) in the voting booth. From that, he predicted a differ- 
ence of 133 votes in favor of the Republicans in absentee ballots. Instead, a differ- 
ence of 1025 votes in favor of the Democrats was observed in the absentee ballots 
of the disputed election. 

Of course, everyone knows that chance events play a role in determining who 
votes in any given election. So Ashenfelter decided to set up and test two hypothe- 
ses. The null hypothesis was that, given past elections as a guide and given the vot- 
ing booth difference, the overall difference observed in this election could be 
explained by chance. The alternative hypothesis was that something other than 
chance influenced the voting results in this election. 

Ashenfelter reported that if chance alone was responsible, there was a 6% chance 
of observing results as extreme as the ones observed in this election, given the vot- 
ing booth difference. In other words, the p-value associated with his test was about 
6%. That is not how the result was reported in the New York Times. When you read 
its report, see if you can detect the mistake in interpretation: 


There is some chance that random variations alone could explain a 1,158-vote 
swing in the 1993 contest—the difference between the predicted 133-vote Re- 
publican advantage and the 1,025-Democratic edge that was reported. More to 
the point, there is some larger probability that chance alone would lead to a 
sufficiently large Democratic edge on the absentee ballots to overcome the Re- 
publican margin on the machine balloting. And the probability of such a swing 
of 697 votes from the expected results, Professor Ashenfelter calculates, was 
about 6 percent. Putting it another way, if past elections are a reliable guide to 
current voting behavior, there is a 94 percent chance that irregularities in the 
absentee ballots, not chance alone, swung the election to the Democrat, Profes- 
sor Ashenfelter concludes. (Passell, 11 April 1994, p. A15; emphasis added) 


The author of this article has mistakenly interpreted the p-value to be the proba- 
bility that the null hypothesis is true and has thus reported what he thought to be the 
probability that the alternative hypothesis was true. We hope you realize that this is 
not a valid conclusion. The p-value can only tell us the probability of observing these 
results if the election was not fraudulent. It cannot tell us the probability in the other 
direction—namely, the probability that the election was fraudulent based on 
observed results. This is akin to the “confusion of the inverse” discussed in Chapter 
18. There we saw that physicians sometimes confuse the (unknown) probability that 
the patient has a disease, given a positive test, with the (known) probability of a pos- 
itive test, given that the patient has the disease. 

You should also realize that the implication that the results of past elections 
would hold in this special election may not be correct. This point was raised by an- 
other statistician involved with the case. The New York Times report notes: 
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Paul Shaman, a professor of statistics at the Wharton School at University of 
Pennsylvania . . . exploits the limits in Professor Ashenfelter’s reasoning. Rela- 
tionships between machine and absentee voting that held in the past, he ar- 
gues, need not hold in the present. Could not the difference, he asks, be 
explained by Mr. Stinson’s “engaging in aggressive efforts to obtain absentee 
votes?” (Passell, 11 April 1994, p. A15) 


The case went to court two more times, but the original decision made by Judge 
Newcomer was upheld each time. The Republican Bruce Marks held the seat until 
December 1994. As a footnote, in the regular election in November 1994, Bruce 
Marks lost by 393 votes to Christina Tartaglione, the daughter of the chair of the 
board of elections, one of the people allegedly involved in the suspected fraud. This 
time, both candidates agreed that the election had been conducted fairly (Shaman, 
28 November 1994). 


For Those Who Like Formulas 


Some Notation for Hypothesis Tests 
The null hypothesis is denoted by Ho, and the alternative hypothesis is denoted 
by A, or Ha- 
“alpha” = a = level of significance = desired probability of making a type 1 
error when Ho is true; we reject Ho if p-value = a. 


“beta” = B = probability of making a type 2 error when H; is true; power = 


1-B 


Steps for Testing the Mean of a Single Population 


Denote the population mean by u and the sample mean and standard deviation by 
X and s, respectively. 


Step 1. Ho: u = Mo, where fg is the chance or status quo value. 
HA: u # uo for a two-sided test; H,: u < po or Hy: u > Mo for a one-sided test, with 
the direction determined by the research hypothesis of interest. 


Step 2. This test statistic applies only if the sample is large. The test statistic is 
Z5 X- Ho 
s/ Vn 


Step 3. The p-value depends on the form of H,. In each case, we refer to the pro- 
portion of the standard normal curve above (or below) a value as the “area” above 
(or below) that value. Then we list the p-values as follows: 


Alternative Hypothesis p-Value 

Ay: u # Mo 2 X area above Izl 
Ay: u > Wo area above z 

Ah: u < Wo area below z 
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Step 4. You must specify the desired a; it is commonly 0.05. Reject Ho if 
p-value = a. 
Steps for Testing a Proportion for a Single Population 


Steps 1, 3, and 4 are the same, except replace u with the population proportion p and 
Mo with the hypothesized proportion po. The test statistic (step 2) is 


Ê — Po 
[PoU = Po) 
n 


Steps for Testing for Equality of Two Population 
Means Using Large Independent Samples 


Steps 1, 3, and 4 are the same, except replace u with (uı — fz) and uo with 0. 
Use previous notation for sample sizes, means, and standard deviations; the test 
statistic (step 2) is 


X, Bi X 
g= E 
S1 ie So 
ny Ng 
Exercises Asterisked (*) exercises are included in the Solutions at the back of the book. 


1. In Exercise 12 in Chapter 20, we learned that in a survey of 507 adult American 
Catholics, 59% answered yes to the question, “Do you favor allowing women to 
be priests?” 

a. Set up the null and alternative hypotheses for deciding whether a majority of 
American Catholics favors allowing women to be priests. 

b. Using Example 3 in this chapter as a guide, compute the test statistic for this 
situation. 

c. If you have done everything correctly, the p-value for the test is about 
0.00005. Based on this, make a conclusion for this situation. Write it in both 
statistical language and in words that someone with no training in statistics 
would understand. 

*2. Refer to Exercise 1. Is the test described there a one-sided or a two-sided test? 

3. Suppose a one-sided test for a proportion resulted in a p-value of 0.03. What 
would the p-value be if the test were two-sided instead? 

*4. Suppose a two-sided test for a difference in two means resulted in a p-value 

of 0.08. 

*a, Using the usual criterion for hypothesis testing, would we conclude that 
there was a difference in the population means? Explain. 
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5. 


ET 


8. 


*b. Suppose the test had been constructed as a one-sided test instead, and the ev- 
idence in the sample means was in the direction to support the alternative hy- 
pothesis. Using the usual criterion for hypothesis testing, would we be able 
to conclude that there was a difference in the population means? Explain. 


Suppose you were given a hypothesized population mean, a sample mean, a 
sample standard deviation, and a sample size for a study involving a random 
sample from one population. What would you use as the test statistic? 


. In Example 3, we showed that the Excel command NORMSDIST(z) gives the 


area below the standardized score z. Use Excel or another computer or calcula- 
tor function to find the p-value for each of the following examples and case stud- 
ies, taking into account whether the test is one-sided or two-sided: 


a. Chapter 23, Example 2, z = 2.17, two-sided test 
b. Chapter 22, Example 1, z = 2.00, one-sided test 
c. Case Study 22.1, z = 4.09, one-sided test 


Suppose you wanted to see whether a training program helped raise students’ 
scores on a standardized test. You administer the test to a random sample of 
students, give them the training program, then readminister the test. For each 
student, you record the increase (or decrease) in the test score from one time to 
the next. 


*a. What would the null and alternative hypotheses be for this situation? 


*b. Suppose the mean change for the sample was 10 points and the accompany- 
ing standard error was 4 points. What would be the standardized score that 
corresponded to the sample mean of 10 points? 


*e. Based on the information in part b, what would you conclude about this 
situation? 


d. What explanation might be given for the increased scores, other than the fact 
that the training program had an impact? 


e. What would have been a better way to design the study in order to rule out 
the explanation you gave in part d? 


Refer to Example 2 in Chapter 13, in which we tested whether there was a rela- 
tionship between gender and driving after drinking alcohol. Remember that the 
Supreme Court used the data to determine whether a law was justified. The law 
differentiated between the ages at which young males and young females could 
purchase 3.2% beer. Specify what a type | and a type 2 error would be for this 
example. Explain what the consequences of the two types of error would be in 
that context. 


. On July 1, 1994, The Press of Atlantic City, NJ, had a headline reading, “Study: 


Female hormone makes mind keener” (p. A2). Here is part of the report: 


Halbreich said he tested 36 post-menopausal women before and after they 
started the estrogen therapy. He gave each one a battery of tests that mea- 
sured such things as memory, hand-eye coordination, reflexes and the abil- 
ity to learn new information and apply it to a problem. After estrogen 


*10. 


11. 


12. 


13. 


CHAPTER 23 Hypothesis Testing—Examples and Case Studies 447 


therapy started, he said, there was a subtle but statistically significant in- 
crease in the mental scores of the patients. 


Explain what you learned about the study, in the context of the material in this 
chapter, by reading this quote. Be sure to specify the hypotheses that were be- 
ing tested and what you know about the statistical results. 


Siegel (1993) reported a study in which she measured the effect of pet owner- 
ship on the use of medical care by the elderly. She interviewed 938 elderly 
adults. One of her results was reported as: “After demographics and health sta- 
tus were controlled for, subjects with pets had fewer total doctor contacts dur- 
ing the one-year period than those without pets (beta = —.07, p < .05)” 
(p. 164). 


a. State the null and alternative hypotheses Siegel was testing. Be careful to dis- 
tinguish between a population and a sample. 


*b. State the conclusion you would make. Be explicit about the wording. 


Refer to Exercise 10. Here is another of the results reported by Siegel: “For sub- 
jects without a pet, having an above-average number of stressful life events re- 
sulted in about two more doctor contacts during the study year (10.37 vs. 8.38, 
p < .005). In contrast, the number of stressful life events was not significantly 
related to doctor visits among subjects with a pet” (1993, p. 164). 


a. State the null and alternative hypotheses Siegel is testing in this passage. No- 
tice that two tests are being performed; be sure to cover both. 


b. Pretend you are a newspaper reporter, and write a short story describing the 
results reported in this exercise. Be sure you do not convey any misleading 
information. You are writing for a general audience, so do not use statistical 
jargon that would be unfamiliar to them. 


In Example 2 in this chapter, we tested whether the average fat lost from 1 year 
of dieting versus | year of exercise was equivalent. The study also measured 
lean body weight (muscle) lost or gained. The average for the 47 men who ex- 
ercised was a gain of 0.1 kg, which can be thought of as a loss of —0.1 kg. The 
standard deviation was 2.2 kg. For the 42 men in the dieting group, there was an 
average loss of 1.3 kg, with a standard deviation of 2.6 kg. Test to see whether 
the average lean body mass lost (or gained) would be different for the popula- 
tion. Specify all four steps of your hypothesis test. 


Professors and other researchers use scholarly journals to publish the results of 
their research. However, only a small minority of the submitted papers is ac- 
cepted for publication by the most prestigious journals. In many academic 
fields, there is a debate as to whether submitted papers written by women 
are treated as well as those submitted by men. In the January 1994 issue of 
European Science Editing (Maisonneuve, January 1994), there was a report on 
a study that examined this question. Here is part of that report: 


Similarly, no bias was found to exist at JAMA [Journal of the American 
Medical Association] in acceptance rates based on the gender of the cor- 
responding author and the assigned editor. In the sample of 1,851 articles 
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14. 


*15. 


considered in this study female editors used female reviewers more often 
than did male editors (P < 0.001). 


That quote actually contains the results of two separate hypothesis tests. Explain 
what the two sets of hypotheses tested are and what you can conclude about the 
p-value for each set. 


On January 30, 1995, Time magazine reported the results of a poll of adult 
Americans, in which they were asked, “Have you ever driven a car when you 
probably had too much alcohol to drive safely?” The exact results were not 
given, but from the information provided we can guess at what they were. Of the 
300 men who answered, 189 (63%) said yes and 108 (36%) said no. The re- 
maining 3 weren’t sure. Of the 300 women, 87 (29%) said yes while 210 (70%) 
said no, and the remaining 3 weren’t sure. 


a. Ignoring those who said they weren’t sure, there were 297 men asked, and 
189 said yes, they had driven a car when they probably had too much alco- 
hol. Does this provide statistically significant evidence that a majority of 
men in the population (that is, more than half) would say that they had 
driven a car when they probably had too much alcohol, if asked? Go through 
the four steps to test this hypothesis. 


b. For the test in part a, you were instructed to perform a one-sided test. Why 
do you think it would make sense to do so in this situation? If you do not 
think it made sense, explain why not. 


c. Repeat parts a and b for the women. (Note that of the 297 women who an- 
swered, 87 said yes.) 


In Example 2 in this chapter, we tested to see whether dieters and exercisers had 
significantly different average fat loss. We concluded that they did because the 
difference for the samples under consideration was 1.8 kg, with a measure of un- 
certainty of 0.83 kg and a standardized score of 2.17. Fat loss was higher for the 
dieters. 


a. Construct a 95% confidence interval for the population difference in mean 
fat loss. Consider the two different methods for presenting results: (1) the 
p-value and conclusion from the hypothesis test or (2) the confidence inter- 
val. Which do you think is more informative? Explain. 


*b. Suppose the alternative hypothesis had been that men who exercised lost 
more fat on average than men who dieted. Would the null hypothesis have 
been rejected? Explain why or why not. If yes, give the p-value that would 
have been used. 


c. Suppose the alternative hypothesis had been that men who dieted lost more 
fat on average than men who exercised. Would the null hypothesis have been 
rejected? Explain why or why not. If yes, give the p-value that would have 
been used. 


g& Exercises 16 to 19 refer to News Story and Original Source 1, “Alterations in Brain 


Cae 


and Immune Function Produced by Mindfulness Meditation,” on the CD. For this 
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study, volunteers were randomly assigned to a meditation group (25 participants) or 
a control group (16 participants). The meditation group received an 8-week medita- 
tion training program. Both groups were given influenza shots and their antibodies 
were measured. This measurement was taken after the meditation group had been 
practicing meditation for about 8 weeks. The researchers also measured brain ac- 
tivity, but these exercises will not explore that part of the study. 


16. To what population do you think the results of this study apply? If you need 
more information to answer this question, you will find a description of the par- 
ticipants in Original Source 1, page 565. 


17. Participants were given psychological tests measuring positive and negative af- 
fect (mood) as well as anxiety at three time periods. Time | was before the med- 
itation training. Time 2 was at the end of the 8 weeks of training, and Time 3 
was 4 months later. One of the results reported in the paper is 


There was a significant decrease in trait negative affect with the medita- 
tors showing less negative affect at Times 2 and 3 compared with their 
negative affect at Time I [t(20) = 2.27 and t(21) = 2.45, respectively, 

p <.05 for both]. Subjects in the control group showed no change over 
time in negative affect (t < 1). (Davidson et al., p. 565) 


The first sentence of the quote reports the results of two hypothesis tests for 
the meditators. Specify in words the null hypothesis for each of the two tests. 
They are the same except that one is for Time 2 and one is for Time 3. State 
each one separately, referring to what the time periods were. Make sure you 
don’t confuse the population with the sample and that you state the hy- 
potheses using the correct one. 


. Refer to part a. State the alternative hypothesis for each test. Explain whether 


you decided to use a one-sided or a two-sided test and why. 


Write the conclusion for the Time 2 test in statistical language and in plain 
English. 


18. Refer to the quote in the previous exercise. Explain what is meant by the last 
sentence. 


*19. Another quote in the article is 


*a, 


b. 


Cc. 
d. 


In response to the influenza vaccine, the meditators displayed a signifi- 
cantly greater rise in antibody titers from the 4 to 8 week blood draw com- 
pared with the controls [t(33) = 2.05, p < .05, Figure 5]. (Davidson et al., 
p. 566) 


Specify in words the null hypothesis being tested. Make sure you don’t con- 
fuse the population with the sample and that you state the hypothesis using 
the correct one. 


Specify in words the alternative hypothesis being tested. Explain whether 
you decided to use a one-sided or a two-sided test and why. 


What is the meaning of the word significantly in the quote? 


Explain in plain English what the results of this test mean. 


450 PART 4 Making Judgments from Surveys and Experiments 


Text not available due to copyright restrictions 


CHAPTER 23 Hypothesis Testing—Examples and Case Studies 451 


Mini-Projects 


1. Find three separate journal articles that report the results of hypothesis tests. For 
each one, do or answer the following: 


a. State the null and alternative hypotheses. 
b. Based on the p-value reported, what conclusion would you make? 
c. What would a type | and a type 2 error be for the hypotheses being tested? 


2. Conduct a test for extrasensory perception. You can either create target pictures 
or use a deck of cards and ask people to guess suits or colors. Whatever you use, 
be sure to randomize properly. For example, with a deck of cards you should al- 
ways replace the previous target and shuffle very well. 


a. Explain how you conducted the experiment. 

b. State the null and alternative hypotheses for your experiment. 
c. Report your results. 
d 


. If you do not find a statistically significant result, can you conclude that ex- 
trasensory perception does not exist? Explain. 
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Significance, Importance, 


and Undetected Differences 


Thought Questions 


1. 


Which do you think is more informative when you are given the results of a study: a 
confidence interval or a p-value? Explain. 


. Suppose you were to read that a new study based on 100 men had found that there 


was no difference in heart attack rates for men who exercised regularly and men who 
did not. What would you suspect was the reason for that finding? Do you think the 
study found exactly the same rate of heart attacks for the two groups of men? 


. An example in Chapter 23 used the results of a public opinion poll to conclude that 


a majority of Americans did not think Bill Clinton had the honesty and integrity they 
expected in a president. Would it be fair reporting to claim that “significantly fewer 
than 50% of American adults in 1994 thought Bill Clinton had the honesty and in- 
tegrity they expected in a president”? Explain. 


. When reporting the results of a study, explain why a distinction should be made be- 


tween “statistical significance” and “significance,” as the term is used in ordinary 
language. 


. Remember that a type 2 error is made when a study fails to find a relationship or dif- 


ference when one actually exists in the population. Is this kind of error more likely to 
occur in studies with large samples or with small samples? Use your answer to ex- 
plain why it is important to learn the size of a study that finds no relationship or 
difference. 
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24.1 Real Importance versus Statistical Significance 


EXAMPLE 1 


By now, you should realize that a statistically significant relationship or difference 
does not necessarily mean an important one. Further, a result that is “significant” in 
the statistical meaning of the word may not be “significant” in the common mean- 
ing of the word. Whether the results of a test are statistically significant or not, it is 
helpful to examine a confidence interval so that you can determine the magnitude of 
the effect. From the width of the confidence interval, you will also learn how much 
uncertainty there was in the sample results. For instance, from a confidence interval 
for a proportion, we can learn the “margin of error.” 


Is the President That Bad? 


In the previous chapter, we examined the results of a Newsweek poll that asked the 
question: “From everything you know about Bill Clinton, does he have the honesty and 
integrity you expect in a president?” (16 May 1994, p. 23). The poll surveyed 518 adults, 
and 233, or 45% of them, said yes. There were 238 no answers (46%), and the rest 
were not sure. Using a hypothesis test, we determined that the proportion of the pop- 
ulation who would answer yes to that question was statistically significantly less than 
half. From this result, would it be fair to report that “significantly less than 50% of all 
American adults in 1994 thought Bill Clinton had the honesty and integrity they ex- 
pected in a president” ? 


What the Word Significant Implies The use of the word significant in that context 
implies that the proportion who felt that way was much less than 50%. However, using 
our computations from Chapter 23 and methods from Chapter 20, let's compute a 95% 
confidence interval for the true proportion who felt that way: 


95% confidence interval = sample value + 2 (standard deviation) 
= 0.45 + 2(0.022) = 0.45 + 0.044 = 0.406 to 0.494 


Therefore, it could be that the true proportion is as high as 49.4%! Although that value 
is less than 50%, it is certainly not “significantly less” in the usual, nonstatistical mean- 
ing of the word. 

In addition, if we were to repeat this exercise for the proportion who answered no 
to the question, we would reach the opposite conclusion. Of the 518 respondents, 238, 
or 46%, answered no when asked, “From everything you know about Bill Clinton, does 
he have the honesty and integrity you expect in a president?” If we construct the test 
as follows: 


Null hypothesis: The population proportion who would have answered no was 0.50. 


Alternative hypothesis: The population proportion who would have answered no 
was less than 0.50. 


the test statistic and p-value would be —1.82 and 0.034, respectively. Therefore, we 
would also accept the hypothesis that less than a majority would answer no to the ques- 
tion. In other words, we have now found that less than a majority would have answered 
yes and less than a majority would have answered no to the question. 
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EXAMPLE 2 


The Importance of Learning the Exact Results The problem is that only 91% of the 
respondents gave a definitive answer. The rest of them had no opinion. Therefore, it 
would be misleading to focus only on the yes answers or only on the no answers with- 
out also reporting the percentage who were undecided. This example illustrates again 
the importance of learning exactly what was measured and what results were used in a 
confidence interval or hypothesis test. Oo 


Is Aspirin Worth the Effort? 


In Case Study 1.2, we examined the relationship between taking an aspirin every other 
day and incidence of heart attack. In testing the null hypothesis (there is no relationship 
between the two actions) versus the alternative (there is a relationship), the chi-square 
(test) statistic is over 25. Recall that a chi-square statistic over 3.84 would be enough to 
reject the null hypothesis. In fact, the p-value for the test is less than 0.00001. 


The Magnitude of the Effect These results leave little doubt that there is a strongly 
Statistically significant relationship between taking aspirin and incidence of heart attack. 
However, the test statistic and p-value do not provide information about the magnitude 
of the effect. And remember, the p-value does not indicate the probability that there is 
a relationship between the two variables. It indicates the probability of observing a sam- 
ple with such a strong relationship, if there is no relationship between the two variables 
in the population. Therefore, it’s important to know the extent to which aspirin is related 
to heart attack outcome. 


Representing the Size of the Effect The data show that the rates of heart attack 
were 9.4 per 1000 for the group who took aspirin and 17.1 per 1000 for those who 
took the placebo. Thus, there is a difference of slightly less than 8 people per 1000, or 
about one less heart attack for every 125 individuals who took aspirin. Therefore, if all 
men who are similar to those in the study were to take an aspirin every other day, the 
results indicate that out of every 125 men, 1 less would have a heart attack than would 
otherwise have been the case. Another way to represent the size of the effect is to note 
that the aspirin group had just over half as many heart attacks as the placebo group, in- 
dicating that aspirin could cut someone's risk almost in half. The original report in which 
these results were reported gave the relative risk as 0.53, with a 95% confidence inter- 
val extending from 0.42 to 0.67. 

Whether that difference is large enough to convince a given individual to start tak- 
ing that much aspirin is a personal choice. However, being told only the fact that there 
is very strong statistical evidence for a relationship between taking aspirin and incidence 
of heart attack does not, by itself, provide the information needed to make that 
decision. a 


24.2 The Role of Sample Size in Statistical Significance 


If the sample size is large enough, almost any null hypothesis can be rejected. This 
is because there is almost always a slight relationship between two variables, or a 
difference between two groups, and if you collect enough data, you will find it. 
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Table 24.1 Age at First Birth and Breast Cancer 


Developed 
Breast Cancer? 


Yes No Total 
Yes 65 4475 4540 

First child 
before age 25? No 31 1597 1628 
Total 96 6072 6168 


Source: Carter et al., 1989, cited in Pagano and Gauvreau, 1993. 


EXAMPLE 3 How the Same Relative Risk Can Produce Different Conclusions 

Consider an example discussed in Chapters 12 and 13 determining the relationship be- 
tween breast cancer and age at which a woman had her first child. The results are 
shown in Table 24.1. The chi-square statistic for the results is 1.746 with p-value of 0.19. 
Therefore, we would not reject the null hypothesis. In other words, we have not found 
a statistically significant relationship between age at which a woman had her first child 
and the subsequent development of breast cancer. This is despite the fact that the rela- 
tive risk calculated from the data is 1.33. 

Now suppose a larger sample size had been used and the same pattern of results 
had been found. In fact, suppose three times as many women had been sampled, but 
the relative risk was still found to be 1.33. In that case, the chi-square statistic would 
also be increased by a factor of 3. Thus, we would find a chi-square statistic of 5.24 with 
p-value of 0.02! For the same pattern of results, we would now declare that there is a 
relationship between age at which a woman had her first child and the subsequent de- 
velopment of breast cancer. Yet, the reported relative risk would still be 1.33, just as in 
the earlier result. a 


24.3 No Difference versus No Statistically 
Significant Difference 


As we have seen, whether the results of a study are statistically significant can de- 
pend on the sample size. In the previous section, we found that the results in Exam- 
ple 3 were not statistically significant. We showed, however, that a larger sample size 
with the same pattern of results would have yielded a statistically significant finding. 

There is a flip side to that problem. If the sample size is too small, an important 
relationship or difference can go undetected. In that case, we would say that the 
power of the test is too low. Remember that the power of a test is the probability of 
making the correct decision when the alternative hypothesis is true. But the null hy- 
pothesis is the status quo and is assumed to be true unless the sample values deviate 
from it enough to convince us that chance alone cannot reasonably explain the devi- 
ation. If we don’t collect enough data (even if the alternative hypothesis is true), we 
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EXAMPLE 4 


EXAMPLE 5 


Table 24.2 Aspirin, Placebo, and Heart Attacks 


Percent 
Heart No Heart Heart Rate 
Attack Attack Total Attacks per 1000 
Aspirin 14 1486 1500 0.93 9.3 
Placebo 26 1474 1500 1.73 17.3 


Total 40 2960 3000 


may not have enough evidence to convincingly rule out the null hypothesis. In that 
case, a relationship that really does exist in a population may go undetected in the 
sample. 


All That Aspirin Paid Off 


The relationship between aspirin and incidence of heart attacks was discovered in a 
long-term project called the Physicians’ Health Study (Physicians’ Health Study Research 
Group, 1988). Fortunately, the study included a large number of participants (22,07 1)— 
or the relationship may never have been found. 

The chi-square statistic for the study was 25.01, highly statistically significant. But 
suppose the study had only been based on 3000 participants, still a large number. Fur- 
ther, suppose that approximately the same pattern of results had emerged, with about 
9 heart attacks per 1000 in the aspirin condition and about 17 per 1000 in the placebo 
condition. The result using the smaller sample would be as shown in Table 24.2. For this 
table, we find the chi-square statistic is 3.65, with a p-value of 0.06. This result is not 
Statistically significant using the usual criterion of requiring a p-value of 0.05 or less. 
Thus, we would have to conclude that there is not a statistically significant relationship 
between taking aspirin and incidence of heart attacks. E 


Important, But Not Significant, Differences in Salaries 


A number of universities have tried to determine whether male and female faculty mem- 
bers with equivalent seniority earn equivalent salaries. A common method for deter- 
mining this is to use the salary and seniority data for men to find a regression equation 
to predict expected salary when seniority is known. The equation is then used to predict 
what each woman's salary should be, given her seniority; this is then compared with her 
actual salary. The differences between actual and predicted salaries are next averaged 
over all women faculty members to see if, on average, they are higher or lower than they 
would be if the equation based on the men’s salaries worked for them. 

Tomlinson-Keasey and colleagues (1994) used this method to study salary differ- 
ences between male and female faculty members at the University of California at Davis. 
They divided faculty into 11 separate groups, by subject matter, to make comparisons 
more useful. 

In each of the 11 groups, the researchers found that the women’s actual pay was 
lower than what would be predicted from the regression equation, and they concluded 
that the situation should be investigated further. However, for some of the subject mat- 
ter groups, the difference found was not statistically significant. For this reason, the re- 
searchers’ conclusion generated some criticism. 


CASE STUDY 24.1 
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Let's look at how large a difference would have had to exist for the study to be statis- 
tically significant. We use the data from the humanities group as an example. There were 
92 men and 51 women included in that analysis. The mean difference between men’s and 
women's salaries, after accounting for seniority and years since Ph.D., was $3612. 

If we were to assume that the data came from some larger population of faculty 
members and test the null hypothesis that men and women were paid equally, then the 
p-value for the test would be 0.08. Thus, a statistically naive reader might conclude that 
no problem exists because the study found no statistically significant difference between 
average salaries for men and for women, adjusted for seniority. 

Because of the natural variability in salaries, even after adjusting for seniority, the 
sample means would have to differ by over $4000 per year for samples of this size to 
be able to declare the difference to be statistically significant. 

The conclusion that there is not a statistically significant difference between men’s 
and women’s salaries does not imply that there is not an important difference. It simply 
means that the natural variability in salaries is so large that a very large difference in 
means would be required to achieve statistical significance. 

As one student suggested, the male faculty who complained about the study’s con- 
clusions because the differences were not statistically significant should donate the “non- 
significant” amount of $3612 to help a student pay fees for the following year. E 


Seen a UFO? You May Be Healthier Than Your Friends 


A survey of 5947 adult Americans taken in the summer of 1991 (Roper Organiza- 
tion, 1992) found that 431, or about 7%, reported having seen a UFO. Some authors 
have suggested that people who make such claims are probably psychologically dis- 
turbed and prone to fantasy. Nicholas Spanos and his colleagues (1993) at Carleton 
University in Ottawa, Canada, decided to test that theory. They recruited 49 volun- 
teers who claimed to have seen or encountered a UFO. They found the volunteers by 
placing a newspaper ad that read, “Carleton University researcher seeks adults who 
have seen UFOs. Confidential” (Spanos et al., 1993, p. 625). 

Eighteen of these volunteers, who reported seeing only “lights or shapes in the 
sky that they interpreted as UFOs” (p. 626), were placed in the “UFO nonintense” 
group. The remaining 31, who reported more complex experiences, were placed in 
the “UFO intense” group. 

For comparison, 74 students and 53 community members were included in the 
study. The community members were recruited in a manner similar to the UFO 
groups, except that the ad read “for personality study” in place of “who have seen 
UFOs.” The students received credit in a psychology course for participating. 

All subjects were given a series of tests and questionnaires. Attributes measured 
were “UFO beliefs, Esoteric beliefs, Psychological health, Intelligence, Temporal 
lobe lability [to see if epileptic type episodes could account for the experiences], 
Imaginal propensities [such as Fantasy proneness], and Hypnotizability” (p. 626). 

The New York Times reported the results of this work (Sullivan, 29 November 
1993) with the headline “Study finds no abnormality in those reporting UFOs.” The 
article described the results as follows: 


A study of 49 people who have reported encounters with unidentified flying 
objects, or UFOs, has found no tendency toward abnormality, apart from a 
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previous belief that such visitations from beyond the earth do occur. . . . The 
tests [given to the participants] included standard psychological tests used to 
identify subjects with various mental disorders and assess their intelligence. 
The UFO group proved slightly more intelligent than the others. 

Source: Sullivan, “Study finds no abnormality in those reporting UFOs,” November 29, 
1993, New York Times, p. 37. 


Reading the Times report would leave the impression that there were no statistically 
significant differences found in the psychological health of the group. In fact, that is 
not the case. On many of the psychological measures, the UFO groups scored sta- 
tistically significantly better, meaning they were healthier than the student and com- 
munity groups. The null hypothesis for this study was that there are no population 
differences in mean psychological scores for those who have encountered UFOs 
compared with those who have not. But the alternative hypothesis of interest to 
Spanos and his colleagues was one-sided: the speculation that UFO observers are 
less healthy. The data, indicating that they might be healthier, was not consistent 
with that alternative hypothesis, so the null hypothesis could not be rejected. Here is 
how Spanos and colleagues (1993) discussed the results in their report: 


The most important findings indicate that neither of the UFO groups scored 
lower on any measures of psychological health than either of the comparison 
groups. Moreover, both UFO groups attained higher psychological health 
scores than either one or both of the comparison groups on five of the psycho- 
logical health variables. In short, these findings provide no support whatso- 
ever for the hypothesis that UFO reporters are psychologically disturbed. 

(p. 628) 


In case you are curious, the UFO nonintense group scored statistically signifi- 
cantly higher than any of the others on the IQ test, and there were no significant dif- 
ferences in fantasy proneness, other paranormal experiences, or temporal lobe 
lability. The UFO groups scored better (healthier) on scales measuring self-esteem, 
schizophrenia, perceptual aberration, stress, well-being and agression. 


24.4 A Summary of Warnings 


From this discussion, you should realize that you can’t simply rely on news reports 
to determine what to conclude from the results of studies. In particular, you should 
heed the following warnings: 


1. If the word significant is used to try to convince you that there is an important ef- 
fect or relationship, determine if the word is being used in the usual sense or in 
the statistical sense only. 


2. If a study is based on a very large sample size, relationships found to be statisti- 
cally significant may not have much practical importance. 

3. If you read that “no difference” or “no relationship” has been found in a study, 
try to determine the sample size used. Unless the sample size was large, remem- 


CASE STUDY 24.2 
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ber that an important relationship may well exist in the population but that not 
enough data were collected to detect it. In other words, the test could have had 
very low power. 


4. If possible, learn what confidence interval accompanies the hypothesis test, if 
any. Even then you can be misled into concluding that there is no effect when 
there really is, but at least you will have more information about the magnitude 
of the possible difference or relationship. 


5. Try to determine whether the test was one-sided or two-sided. If a test is one- 
sided, as in Case Study 24.1, and details aren’t reported, you could be misled into 
thinking there is no difference, when in fact there was one in the direction oppo- 
site to that hypothesized. 


6. Remember that the decision to do a one-sided test must be made before looking 
at the data, based on the research question. Using the same data to both generate 
and test the hypotheses is cheating. A one-sided test done that way will have a 
p-value smaller than it should, making it easier to reject the null hypothesis. 


7. Sometimes researchers perform a multitude of tests, and the reports focus on 
those that achieved statistical significance. If all of the null hypotheses tested are 
true, then | in 20 tests should achieve statistical significance just by chance. Be- 
ware of reports where it is evident that many tests were conducted, but where re- 
sults of only one or two are presented as “significant.” 


Finding Loneliness on the Internet 


It was big news. Researchers at Carnegie Mellon University (Kraut et al., 1998) 
found that “greater use of the Internet was associated with declines in participants’ 
communication with family members in the household, declines in size of their so- 
cial circle, and increases in their depression and loneliness” (p. 1017). An article in 
the New York Times reporting on this study was entitled, “Sad, lonely world discov- 
ered in cyberspace” (Harmon, 30 August 1998). The study included 169 individuals 
in 73 households in Pittsburgh, Pennsylvania, who were given free computers and 
Internet service in 1995. The participants answered a series of questions at the be- 
ginning of the study and again either | or 2 years later, measuring social contacts, 
stress, loneliness, and depression. The New York Times reported: 


In the first concentrated study of the social and psychological effects of Inter- 
net use at home, researchers at Carnegie Mellon University have found that 
people who spend even a few hours a week online have higher levels of depres- 
sion and loneliness than they would if they used the computer network less 
frequently . . . it raises troubling questions about the nature of “virtual” com- 
munication and the disembodied relationships that are often formed in cyber- 
space. 


Source: Harmon, “Sad, lonely world discovered in cyberspace,” August 30, 1998, New 
York Times, p. A3. 


Given these dire reports, one would think that using the Internet for a few hours a 
week is devastating to your mental health. But a closer look at the findings reveals 
that the changes were actually quite small, although statistically significant. 
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Internet use averaged 2.43 hours per week for participants. The statistical analy- 
sis used in the study was more complicated than that in this book, but some simple 
statistics illustrate the magnitude of the results. The number of people in the partic- 
ipants’ “local social network” decreased from an average of 23.94 people to an av- 
erage of 22.90 people, hardly a noticeable loss. On a scale from | to 5, self-reported 
loneliness decreased from an average of 1.99 to 1.89; lower scores indicate greater 
loneliness. And on a scale from 0 to 3, self-reported depression dropped from an av- 
erage of 0.73 to an average of 0.62; lower scores indicate higher depression. Further, 
more measures were taken than those reported, but only those that were statistically 
significant received attention, a problem discussed in Warning 7. For instance, the 
study measured social support, defined as “the number of people with whom they 
can exchange social resources,” and reported that “although the association between 
Internet use and subsequent social support is negative, the effect does not approach 
statistical significance ( p > 0.40)” (Kraut et al., 1998, p. 1023). The news report did 
not mention that “social support” was measured but not found to be significant. 

The New York Times did report the magnitude of some of the changes near the 
end of the article, noting for instance that “one hour a week on the Internet was as- 
sociated, on average, with an increase of 0.03, or | percent on the depression scale.” 
But the attention the research received masked the fact that the impact of Internet use 
on depression, loneliness, and social contact was actually quite small. 


Exercises 


Asterisked (*) exercises are included in the Solutions at the back of the book. 


*1. An advertisement for Claritin, a drug for seasonal nasal allergies, made this 
claim: “Clear relief without drowsiness. In studies, the incidence of drowsiness 
was similar to placebo” (Time, 6 February 1995, p. 43). The advertisement also 
reported that 8% of the 1926 Claritin takers and 6% of the 2545 placebo takers 
reported drowsiness as a side effect. A one-sided test of whether a higher pro- 
portion of Claritin takers than placebo takers would experience drowsiness in 
the population results in a p-value of about 0.005. 


*a. Can you conclude that the incidence of drowsiness for the Claritin takers is 
statistically significantly higher than that for the placebo takers? 


*b. Does the answer to part a contradict the statement in the advertisement that 
the “incidence of drowsiness was similar to placebo”? Explain. 


c. Use this example to discuss the importance of making the distinction be- 
tween the common use and the statistical use of the word significant. 


2. Refer to Case Study 24.2, in which a report stated that Internet use was associ- 
ated with a statistically significant increase in depression. 


a. Would it have been more appropriate to use one-sided or two-sided hypoth- 
esis tests for that research? Explain. 


b. Explain what would have constituted a type 1 and type 2 error when testing 
whether Internet use is associated with greater loneliness. Which type of er- 
ror could have been committed in this study? 


3. 


*4, 


5. 
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An article in Time magazine (Gorman, 6 February 1995) reported that an advi- 
sory panel recommended that the Food and Drug Administration (FDA) allow 
an experimental AIDS vaccine to go forward for testing on 5000 volunteers. The 
vaccine was developed by Jonas Salk, who developed the first effective polio 
vaccine. The AIDS vaccine was designed to boost the immune system of HIV- 
infected individuals and had already been tested on a small number of patients, 
with mixed results but no apparent side effects. 


a. In making its recommendation to the FDA, the advisory panel was faced 
with a choice similar to that in hypothesis testing. The null hypothesis was 
that the vaccine was not effective and therefore should not be tested further, 
whereas the alternative hypothesis was that it might have some benefit. Ex- 
plain the consequences of a type | and a type 2 error for the decision the 
panel was required to make. 


b. The chairman of the panel, Dr. Stanley Lemon, was quoted as saying, “I’m 
not all that excited about the data I’ve seen . . . [but] the only way the con- 
cept is going to be laid to rest . . . is really to try [the vaccine] in a large pop- 
ulation” (p. 53). Explain why the vaccine should be tested on a larger group, 
when it had not proven effective in the initial tests on a small group. 


In Example 2 of Chapter 13, we revisited data from Case Study 6.3, regarding 
testing to see if there was a relationship between gender and driving after drink- 
ing. We found that we could not rule out chance as an explanation for the sam- 
ple data; the chi-square statistic was 1.637 and the p-value was 0.20. Now 
suppose that the sample contained three times as many drivers, but the propor- 
tions of males and females who drank before driving were still 16% and 11.6%, 
respectively. 


*a, What would be the value of the chi-square statistic for this hypothetical 
larger sample? (Hint: See Example 3 in this chapter.) 


b. The p-value for the test based on the larger sample would be about 0.03. Re- 
state the hypotheses being tested and explain which one you would choose 
on the basis of this larger hypothetical sample. 


c. In both the original test and the hypothetical one based on the larger sample, 
the probability of making a type 1 error if there really is no relationship is 
5%. Assuming there is a relationship in the population, would the power of 
the test—that is, the probability of correctly finding the relationship—be 
higher, lower, or the same for the sample three times larger compared with 
the original sample size? 


New Scientist (Mestel, 12 November 1994) reported a study in which psychia- 
trist Donald Black used the drug fluvoxamine to treat compulsive shoppers: 


In Black’s study, patients take the drug for eight weeks, and the effect on 
their shopping urges is monitored. Then the patients are taken off the drug 
and watched for another month. In the seven patients examined so far, the 
results are clear and dramatic, says Black: the urge to shop and the time 
spent shopping decrease markedly. When the patient stops taking the drug, 
however, the symptoms slowly return. (p. 7) 
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6. 


*8. 


9. 


*10. 


11. 


a. Explain why it would have been preferable to have a double-blind study, in 
which shoppers were randomly assigned to take fluvoxamine or a placebo. 


b. What are the null and alternative hypotheses for this research? 


c. Can you make a conclusion about the hypotheses in part b on the basis of this 
report? Explain. 


In reporting the results of a study to compare two population means, explain 
why researchers should report each of the following: 


a. A confidence interval for the difference in the means. 


b. A p-value for the results of the test, as well as whether it was one- or 
two-sided. 


c. The sample sizes used. 


d. The number of separate tests they conducted during the course of the re- 
search. 


. Explain why it is important to learn what sample size was used in a study for 


which “no difference” was found. 


The top story in USA Today on December 2, 1993, reported that “two research 
teams, one at Harvard and one in Germany, found that the risk of a heart attack 
during heavy physical exertion . . . was two to six times greater than during less 
strenuous activity or inactivity” (Snider, 2 December 1993, p. 1A). It was 
implicit, but not stated, that these results were for those who do not exercise 
regularly. 


*a. There is a confidence interval reported in this statement. Explain what it is 
and what characteristic of the population it is measuring. 


b. The report also stated that “frequent exercisers had slight or no increased 
risk.” (In other words, frequent exercisers had slight or no increased risk of 
heart attack when they took on a strenuous activity compared with when they 
didn’t.) Do you think that means the relative risk was actually 1.0? If not, 
discuss what the statement really does mean in the context of what you have 
learned in this chapter. 


We have learned that the probability of making a type 1 error when the null hy- 
pothesis is true is usually set at 5%. The probability of making a type 2 error 
when the alternative hypothesis is true is harder to find. Do you think that prob- 
ability depends on the size of the sample? Explain your answer. 


Refer to Case Study 22.1, concerning the ganzfeld procedure for testing ESP. In 
earlier studies using the ganzfeld procedure, the results were mixed in terms of 
whether they were statistically significant. In other words, some of the experi- 
ments were statistically significant and others were not. Critics used this as ev- 
idence that there was really nothing going on, even though more than 5% of the 
experiments were successful. Typical sample sizes were from 10 to 100 partici- 
pants. Give an explanation for why the results would have been mixed. 

When the Steering Committee of the Physicians’ Health Study Research Group 
(1988) reported the results of the effect of aspirin on heart attacks, committee 
members also reported the results of the same aspirin consumption, for the same 


*12. 


13. 


14. 


15. 
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sample, on strokes. There were 80 strokes in the aspirin group and only 70 in 
the placebo group. The relative risk was 1.15, with a 95% confidence interval 
ranging from 0.84 to 1.58. 


a. What value for relative risk would indicate that there is no relationship be- 
tween taking aspirin and having a stroke? Is that value contained in the con- 
fidence interval? 


b. Set up the appropriate null and alternative hypotheses for this part of the 
study. The original report gave a p-value of 0.41 for this test. What conclu- 
sion would be made on the basis of that value? 


c. Compare your results from parts a and b. Explain how they are related. 


d. There was actually a higher percentage of strokes in the group that took as- 
pirin than in the group that took a placebo. Why do you think this result did 
not get much press coverage, whereas the result indicating that aspirin re- 
duces the risk of heart attacks did get substantial coverage? 


The authors of the report in Case Study 24.1, comparing the psychological 
health of UFO observers and nonobservers, presented a table in which they 
compared the four groups of volunteers on each of 20 psychological measures. 
For each of the measures, the null hypothesis was that there were no differences 
in population means for the various types of people on that measure. If there 
truly were no population differences, and if all of the measures were indepen- 
dent of each other, for how many of the 20 measures would you expect the null 
hypothesis to be rejected, using the 0.05 criterion for rejection? Explain how 
you got your answer. 


In the study by Lee Salk, reported in Case Study 1.1, he found that infants who 
listened to the sound of a heartbeat in the first few days of life gained more 
weight than those who did not. In searching for potential explanations, Salk 
wrote the following. Discuss Salk’s conclusion and the evidence he provided for 
it. Do you think he effectively ruled out food intake as being important? 


In terms of actual weight gain the heartbeat group showed a median gain 
of 40 grams; the control group showed a median loss of 20 grams. There 
was no significant difference in food intake between the two groups. 

... There was crying 38 percent of the time in the heartbeat group of in- 
fants; in the control group one infant or more cried 60 percent of the 
time. . . . Since there was no difference in food intake between the two 
groups, it is likely that the weight gain for the heartbeat group was due to 
a decrease in crying. (Salk, May 1973, p. 29) 


An advertisement for NordicTrack, an exercise machine that simulates cross- 
country skiing, claimed that “in just 12 weeks, research shows that people who 
used a NordicTrack lost an average of 18 pounds.” Forgetting about the ques- 
tions surrounding how such a study might have been conducted, what additional 
information would you want to know about the results before you could come 
to a reasonable conclusion? 


Explain why it is not wise to accept a null hypothesis. 
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16. Now that you understand the reasoning behind making inferences about popu- 
lations based on samples (confidence intervals and hypothesis tests), explain 
why these methods require the use of random, or at least representative, samples 
instead of convenience samples. 


*17. Would it be easier to reject hypotheses about populations that had a lot of nat- 
ural variability in the measurements or a little variability in the measurements? 
Explain. 


© Exercises 18 to 22 refer to News Story and Original Source 1, “Alterations in 
Brain and Immune Function Produced by Mindfulness Meditation” on the CD. For 
this study, volunteers were randomly assigned to a meditation group (25 partici- 
pants) or a control group (16 participants). The meditation group received an 
8-week meditation training program. Both groups were given influenza shots and 
their antibodies were measured. This measurement was taken after the meditation 
group had been practicing meditation for about 8 weeks. The researchers also 
measured brain activity, but these exercises will not explore that part of the study. 


18. One of the quotes in News Story 1 refers to the results of measuring antibodies 
to the influenza vaccine. It says: 


b. 


While everyone who participated in the study had an increased number of 
antibodies, the meditators had a significantly greater increase than the 
control group. “The changes were subtle, but statistically it was signifi- 
cant,” says [one of the researchers]. (Kotansky, 2003, p. 35) 


Do you think it would have been fair reporting if only the first sentence of 
the quote had been given, without the second sentence? Explain, including 
what additional information you learned from the second sentence. 


Based on this quote, what null hypothesis do you think was being tested? 
Make sure you use correct terminology regarding populations and samples. 


19. Refer to the previous exercise. 


a. 


b. 


Do you think the alternative hypothesis used by the researchers was one- 
sided or two-sided? Explain. 


Do you think the researchers would be justified in specifying a one-sided al- 
ternative hypothesis in this situation? Explain why or why not. 


Text not available due to copyright restrictions 
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Text not available due to copyright restrictions 


22. Refer to items 1 and 2 in Section 24.4, “A Summary of Warnings.” Explain 
whether or not they are likely to apply to results from this study. 


Sp Exercises 23 and 24 refer to News Story 19 and Additional News Story 19 on the CD. 
The following quote is from Additional News Story 19: 


They found that adolescents who were romantically involved during the year 
experienced a significantly larger increase in symptoms of depression than 
adolescents who were not romantically involved. They also found that depres- 
sion levels of romantically involved girls increased more sharply than that of 
romantically involved boys, especially among younger adolescents. (Lang, 


2001) 

23. Read News Story 19, “Young romance may lead to depression, study says” and 
use it to answer these questions. 
a. How many participants were there in this study? 


b. The story actually reports the magnitude of the difference in depression lev- 
els for romantically involved and uninvolved teens. What was it? 


c. Given your answers to parts a and b, which two of the seven warnings in Sec- 
tion 24.4 apply to the preceding quote from Additional News Story 19? 

d. Do you think the word significantly in the quote is being used in the statisti- 
cal or the English sense of the word? 


24. Read News Story 19 and Additional News Story 19. Based on the information 
contained in them, rewrite and expand the information in the preceding quote so 
that it would not be misleading to someone with no training in statistics. 


Mini-Projects 


1. Find a newspaper article that you think presents the results of a hypothesis test 
in a misleading way. Explain why you think it is misleading. Rewrite the ap- 
propriate part of the article in a way that you consider to be not misleading. If 
necessary, find the original journal report of the study so you can determine 
what details are missing from the newspaper account. 


2. Find two journal articles, one that reports on a statistically significant result and 
one that reports on a nonsignificant result. Discuss the role of the sample size in 
the determination of statistical significance, or lack thereof, in each case. Dis- 
cuss whether you think the researchers would have reached the same conclusion 
if they had used a smaller or larger sample size. 
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Meta-Analysis: 
Resolving Inconsistencies 
across Studies 


Thought Questions 


1. 


Suppose a new study involving 14,000 participants has found a relationship between 
a particular food and a certain type of cancer. The report on the study notes that 
“past studies of this relationship have been inconsistent. Some of them have found 
a relationship, but others have not.” What might be the explanation for the incon- 
sistent results from the different studies? 


. Suppose 10 similar studies, all on the same kind of population, have been conducted 


to determine the relative risk of heart attack for those who take aspirin and those 
who don't. To get an overall picture of the relative risk we could compute a separate 
confidence interval for each study or combine all the data to create one confidence 
interval. Which method do you think would be preferable? Explain. 


. Suppose two studies have been done to compare surgery versus relaxation for suf- 


ferers of chronic back pain. One study was done at a back specialty clinic and the 
other at a suburban medical center. The result of interest in each case was the rela- 
tive risk of further back problems following surgery versus following relaxation train- 
ing. To get an overall picture of the relative risk, we could compute a separate 
confidence interval for each study or combine the data to create one confidence in- 
terval. Which method do you think would be preferable? Explain. 


. Refer to Thought Questions 2 and 3. If two or more studies have been done to mea- 


sure the same relative risk, give one reason why it would be better to combine the 
results and one reason why it would be better to look at the results separately for 
each study. 
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25.1 The Need for Meta-Analysis 


The obvious questions were answered long ago. Most research today tends to focus 
on relationships or differences that are not readily apparent. For example, plants that 
are obviously poisonous have been absent from our diets for centuries. But we may 
still consume plants (or animals) that contribute to cancer or other diseases over the 
long term. The only way to discover these kinds of relationships is through statisti- 
cal studies. 

Because most of the relationships we now study are moderate in size, researchers 
often fail to find a statistically significant result. As we learned in the last chapter, 
the number of participants in a study is a crucial factor in determining whether the 
study finds a “significant” relationship or difference, and many studies are simply 
too small to do so. As a consequence, reports are often published that appear to con- 
flict with earlier results, confusing the public and researchers alike. 


The Vote-Counting Method 


One way to address this problem is to gather all the studies that have been done on 
a topic and try to assimilate the results. Until recently, if researchers wanted to ex- 
amine the accumulated body of evidence for a particular relationship, they would 
find all studies conducted on the topic and simply count how many of these had 
found a statistically significant result. They would often discount entirely all studies 
that had not and then attempt to explain any remaining differences in study results 
by subjective assessments. 

As you should realize from Chapter 24, this vote-counting method is seriously 
flawed unless the number of participants in each study is taken into account. For ex- 
ample, if 10 studies of the effect of aspirin on heart attacks had each been conducted 
on 2200 men, rather than one study on 22,000 men, none of the 10 studies would be 
likely to show a relationship. In contrast, as we saw in Chapter 13, the one study on 
22,000 men showed a very significant relationship. In other words, the same data, 
had it been collected by 10 separate researchers, easily could have resulted in exactly 
the opposite conclusion to what was found in the one large study. 


What Is Meta-Analysis? 


Since the early 1980s researchers have become increasingly aware of the problems 
with traditional research synthesis methods. They are now more likely to conduct a 
meta-analysis of the studies. Meta-analysis is a collection of statistical techniques 
for combining studies. These techniques focus on the magnitude of the effect in each 
study, rather than on either a vote count or a subjective evaluation of the available 
evidence. 

Quantitative methods for combining results have been available since the early 
1900s, but it wasn’t until 1976 that the name “meta-analysis” was coined. The sem- 
inal paper was called, “Primary, secondary and meta-analysis of research.” In that 
paper, Gene Glass (1976) showed how these methods could be used to synthesize the 
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results of studies comparing treatments in psychotherapy. It was an idea whose time 
had come. Researchers and methodologists set about to create new techniques for 
combining and comparing studies, and thousands of meta-analyses were undertaken. 


What Meta-Analysis Can Do 


Meta-analysis has made it possible to find definitive answers to questions about 
small and moderate relationships by taking into account data from a number of dif- 
ferent studies. It is not without its critics, however, and as with most statistical meth- 
ods, difficulties and disasters can befall the naive user. In the remainder of this 
chapter, we examine some decisions that can affect the results of a meta-analysis, 
discuss some benefits and criticisms of meta-analysis, and look at two case studies. 


25.2 Two Important Decisions for the Analyst 


You have already learned how to do a proper survey, experiment, or observational 
study. Conducting a meta-analysis is an entirely different enterprise because it does 
not involve using new participants at all. It is basically a study of the studies that 
have already been done on a particular topic. 

A variety of decisions must be made when conducting a meta-analysis, many of 
which are beyond the scope of our discussion. However, there are two important 
decisions of which you should be informed when you read the results of a meta- 
analysis. The answers will help you determine whether to believe the results. 


When reading the results of meta-analysis, you should know 


1. Which studies were included 
2. Whether the results were compared or combined 


Question 1: Which Studies Should Be Included? 
Types of Studies 


As you learned early in this book, studies are sometimes conducted by special- 
interest groups. Some studies are conducted by students to satisfy requirements for 
a Ph.D. degree but then never published in a scholarly journal. Sometimes studies 
are reported at conferences and published in the proceedings. These studies are not 
carefully reviewed and criticized in the same way they would be for publication in a 
scholarly journal. Thus, one decision that must be made before conducting a meta- 
analysis is whether to include all studies on a particular topic or only those that have 
been published in properly reviewed journals. 
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Timing of the Studies 

Another consideration is the timing of the studies. For example, in Case Study 25.2, 
we review a recent meta-analysis that attempted to answer the question of whether 
mammograms should be used for the detection of breast cancer in women in their 
40s. Should the analysis have included studies started many years ago, when neither 
the equipment nor the technicians may have been as sophisticated as they are today? 


Quality Control 

Sometimes studies are investigated for quality before they are included. For instance, 
Eisenberg and colleagues (15 June 1993) conducted a meta-analysis on whether be- 
havioral techniques such as biofeedback and relaxation were effective in lowering 
blood pressure in people whose blood pressure was too high. They identified a total 
of 857 articles for possible inclusion but used only 26 of them in the meta-analysis. 
In order to be included, the studies had to meet stringent criteria, including the use 
of one or more control conditions, random assignment into the experimental and 
control groups, detailed descriptions of the intervention techniques for treatment and 
control, and so on. They also excluded studies that used children or that used only 
pregnant women. 

Thus, they decided to exclude studies that had some of the problems you learned 
about earlier in this book, such as nonrandomized treatments or no control group. 
Using the remaining 26 studies, they found that both the behavioral techniques and 
the placebo techniques (such as “sham meditation”) produced effects but that there 
was not a significant difference between the two. Their conclusion was that “cogni- 
tive interventions for essential hypertension are superior to no therapy but not supe- 
rior to credible sham techniques or self-monitoring” (p. 964). If they had included a 
multitude of studies that did not use placebo techniques, they may not have realized 
that those techniques could be as effective as the real thing. 


Accounting for Differences in Quality 

Some researchers believe all possible studies should be included in the meta- 
analysis. Otherwise, they say, the analyst could be accused of including studies that 
support the desired outcome and finding excuses to exclude those that don’t. One so- 
lution to this problem is to include all studies but account for differences in quality 
in the process of the analysis. A meta-analysis of programs designed to prevent ado- 
lescents from smoking used a compromise approach: 


Evaluations of 94 separate interventions were included in the meta-analysis. 
Studies were screened for methodological rigor and those with weaker method- 
ology were segregated from those with more defensible methodology; major 
analyses focused on the latter. (Bruvold, 1993, p. 872) 


Assessing Quality 

Chalmers and his co-authors (1981) constructed a generic set of criteria for deciding 
which papers to include in a meta-analysis and for assessing the quality of studies. 
Their basic idea is to have two researchers who are blind as to who conducted the 
studies and what the results were, independently make decisions about the quality of 
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each study. In other words, there should be two independent assessments about 
whether to include each study, and those assessments should be made without know- 
ing how the study came out. This technique should eliminate inclusion biases based 
on whether the results were in a desired direction. 

When you read the results of a meta-analysis, you should be told how the authors 
decided which studies to include. If there was no attempt to discount flawed studies 
in some way, then you should realize that the combined results may also be flawed. 


Question 2: Should Results Be Compared or Combined? 


In Chapters 11 and 12, you learned about the perils of combining data from separate 
studies. Recall that Simpson’s Paradox can occur when you combine the results of 
two separate contingency tables into one. The problem occurs when a relationship is 
different for one population than it is for another but the results for samples from the 
two are combined. 


Meta-Analysis and Simpson’s Paradox 

Meta-analysis is particularly prone to the problem of Simpson’s Paradox. For exam- 
ple, consider two studies comparing surgery to relaxation as treatments for chronic 
back pain. Suppose one is conducted at a back-care specialty clinic, whereas the 
other is conducted at a suburban medical center. It is likely that the patients with the 
most severe problems will seek out the specialty clinic. Therefore, relaxation train- 
ing may be sufficient and preferable for the suburban medical center, but surgery 
may be preferable for the specialty clinic. 


Populations Must Be the Same and Methods Similar 

A meta-analysis should never attempt to combine the data from all studies into one 
hypothesis test or confidence interval unless it is clear that the same populations 
were sampled and similar methods were used. Otherwise, the analysis should first 
attempt to see if there are readily explainable differences among the results. 

For example, in Case Study 25.1, we will investigate a meta-analysis of the ef- 
fects of cigarette smoking on sperm density. Some of the studies included used in- 
fertility clinics as the source of participants; others used the general population. The 
researchers discovered that the relationship between smoking and sperm density was 
more variable across the studies for which the participants were from infertility clin- 
ics. They speculated that “it is possible that smoking has a greater effect on normal 
men since infertility clinic populations may have other reasons for the lowered 
sperm density” (Vine et al., January 1994, p. 41). Because of the difference, they re- 
ported results separately for the two sources of participants. If they had not, the re- 
lationship between smoking and sperm density would have been underestimated for 
the general population. 

When you read the results of a meta-analysis, you need to ascertain whether 
something like Simpson’s Paradox could have been a problem. If dozens of studies 
have been combined into one result, without any mention of whether the potential 
for this problem was investigated, you should be suspicious. 
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CASE STUDY 25.1 


Smoking and Reduced Fertility 


Vine and colleagues (January 1994) identified 20 studies, published between Janu- 
ary 1966 and October 1, 1992, that examined whether men who smoked cigarettes 
had lower sperm density than men who did not smoke. They found that most of the 
studies reported reduced sperm density for men who smoked, but the difference was 
statistically significant for only a few of the studies. Thus, in the traditional scien- 
tific review, the accumulated results would seem to be inconsistent and inconclusive. 

Studies were excluded from the meta-analysis for only two reasons. One was if 
the study was a subset of a larger study that was already included. The other was if 
it was obvious that the smokers and nonsmokers differed with respect to fertility sta- 
tus. One study was excluded for that reason. The nonsmokers all had children, 
whereas the smokers were attending a fertility clinic. 

A variety of factors differed among the studies. For example, only 10 of the stud- 
ies reported that the person measuring the sperm density was blind to the smoking 
status of the participants. In 6 of the studies, a “smoker” was defined as someone 
who smoked at least 10 cigarettes per day. Thirteen of the studies used infertility 
clinics as a source of participants, 7 did not. All of these factors were checked to see 
if they resulted in a difference in outcome. 

None of these factors resulted in a difference. However, the authors were suspi- 
cious of 2 studies conducted on infertility clinic patients because those 2 studies 
showed a much larger relationship between smoking and sperm density than the re- 
maining 11 studies on that same population. When those 2 studies were omitted, the 
remaining studies using infertility clinic patients showed a smaller average effect 
than the studies using normal men. The authors conducted an analysis using all of 
the data, as well as separate analyses on the two types of populations: those attend- 
ing infertility clinics and those not attending. They omitted the 2 studies with larger 
effects when studying only the infertility clinics. 

The authors found that there was indeed lower average sperm density for men 
who smoked. Using all of the data combined, but giving more weight to studies with 
larger sample sizes, they found that the reduction in sperm density for smokers com- 
pared with nonsmokers was 12.6%, with a 95% confidence interval extending from 
8.0% to 17.1%. A test of the null hypothesis that the reduction in sperm density for 
smokers in the population is actually zero resulted in a p-value less than .0001. The 
estimate of reduction in sperm density for the normal (not infertile) men only was 
even higher, at 23.3%; a confidence interval was not provided. 

The authors also conducted their own study of this question using 88 volunteers 
recruited through the newspaper. In summarizing the findings of past reviews, their 
meta-analysis, and their own study of 88 men, Vine and colleagues (January 1994) 
illustrate the importance of meta-analysis: 


The results of this meta-analysis indicate that smokers’ sperm density is on av- 
erage 13% [when studies are weighted by sample size] to 17% [when studies 
are given equal weight] lower than that of nonsmokers. . . . The reason for the 
inconsistencies in published findings with regard to the association between 
smoking and sperm density appears to be the result of random error and small 
sample sizes in most studies. Consequently, the power is low and the chance of 
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a false negative finding is high. Results of the authors’ study support these find- 
ings. The authors noted a 22.8% lower sperm density among smokers com- 
pared with nonsmokers. However, because of the inherent variability in sperm 
density among individuals, the study sample size (n = 88) was insufficient to 
produce Statistically significant results. (p. 40) 


As with individual studies, results from a meta-analysis like this one should not 
be interpreted to imply a causal relationship. As the authors note, there are potential 
confounding factors. They mention studies that have shown smokers are more likely 
to consume drugs, alcohol, and caffeine and are more likely to experience sex early 
and to be divorced. Those are all factors that could contribute to reductions in sperm 
density, according to the authors. 


25.3 Some Benefits of Meta-Analysis 


EXAMPLE 1 


We have just seen one of the major benefits of meta-analysis. When a relationship is 
small or moderate, it is difficult to detect with small studies. A meta-analysis allows 
researchers to rule out chance for the combined results, when they could not do so 
for the individual studies. It also allows researchers to get a much more precise esti- 
mate of a relationship—that is, a narrower confidence interval—than they could get 
with individual small studies. 


Some benefits of meta-analysis are 


1. Detecting small or moderate relationships 

2. Obtaining a more precise estimate of a relationship 
3. Determining future research 

4. Finding patterns across studies 


Determining Future Research 


Summarizing what has been learned so far in a research area can lead to insights 
about what to ask next. In addition, identifying and enumerating flaws from past 
studies can help illustrate how to better conduct future studies. 


Designing Better Experiments 


In Case Study 22.1, we examined the ganzfeld experiments used to study extrasensory 
perception. The experiments in that case study were conducted using an automated pro- 
cedure that was developed in response to an earlier meta-analysis. The earlier meta- 
analysis had found highly significant differences when compared with chance 
(Honorton, 1985). However, a critic who was involved with the analysis insisted that the 
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EXAMPLE 2 


results were, in fact, due to flaws in the procedure (Hyman, 1985). He identified flaws 
such as improper randomization of target pictures and potentially sloppy data recording. 
He also identified more subtle flaws. For example, some of the early studies used pho- 
tographs that a sender was supposed to transmit mentally to a receiver. If the sender 
was holding a photograph of the true target and the receiver was later asked to pick 
which of four pictures had been the true target, the receiver could have seen the 
sender's fingerprints, and these—and not some psychic information—could have pro- 
vided the answer. The new experiments, reported in Case Study 22.1, were designed to 
be free of all the flaws that had been identified in the first meta-analysis. a 


Changes in Focus of the Research 


Another example is provided by a meta-analysis on the impact of sexual abuse on chil- 
dren (Kendall-Tackett et al., 1993). In this case, the authors did not suggest corrections 
of flaws but rather changes in focus for future research. For example, they noted that 
many studies of the effects of sexual abuse combine children across all age groups into 
one search for symptoms. In their analysis across studies, they were able to look at dif- 
ferent age groups to see if the consequences of abuse differed. They found: 


For preschoolers, the most common symptoms were anxiety, nightmares, general 
PTSD [post-traumatic stress disorder], internalizing, externalizing, and inappropriate 
sexual behavior. For school-age children, the most common symptoms included 
fear, neurotic and general mental illness, aggression, nightmares, school problems, 
hyperactivity, and regressive behavior. For adolescents, the most common behav- 
iors were depression; withdrawn, suicidal, or self-injurious behaviors; somatic com- 
plaints; illegal acts; running away; and substance abuse. (p. 167) 


In essence, the authors are warning researchers that they could encounter Simpson's 
Paradox if they do not recognize the differences among various age groups. They pro- 
vide advice for anyone conducting future studies in this domain: 


Many researchers have studied subjects from very broad age ranges (e.g., 3-18 
years) and grouped boys and girls together . . . this grouping together of all ages 
can mask particular developmental patterns of the occurrence of some symptoms. 
At a minimum, future researchers should divide children into preschool, school, 
and adolescent age ranges when reporting the percentages of victims with symp- 
toms. (p. 176) E 


Finding Patterns across Studies 


Sometimes patterns that are not apparent in single studies become apparent in a 
meta-analysis. This could be due to small sample sizes in the original studies or to 
the fact that each of the original studies considered only one part of a question. 

In Case Study 25.1, in which smokers were found to have lower sperm density, 
the authors were able to investigate whether the decrease was more pronounced for 
heavier smokers. In the 12 studies for which relevant information was provided, 8 
showed evidence that the magnitude of decrease in sperm density was related to in- 
creasing numbers of cigarettes smoked. None of these studies alone provided suffi- 
cient evidence to detect this pattern. 
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EXAMPLE 3 Grouping Studies According to Orientation 


Bruvold (1993) compared adolescent smoking prevention programs. He characterized 
programs as having one of four orientations. The “rational” orientation focused on lec- 
tures and displays of substances; the “developmental” orientation used lectures with 
discussion, problem solving, and some role-playing; the “social norms” orientation used 
participation in community and recreational activities; and the “social reinforcement” 
orientation used discussion, role-playing, and public commitment not to smoke. 

Individual studies are likely to focus on one orientation only. But a meta-analysis can 
group studies according to which orientation they used and do a comparison. Bruvold 
(1993) found that 


the rational orientation had very little impact on behavior, that the social norms 
and developmental orientations had approximately the same intermediate impact 
on behavior, and that the social reinforcement orientation had the greatest impact 
on behavior. (pp. 877-878) 


Of course, caution must be applied when comparing studies on one feature only. As 
with most research, confounding factors are a possibility. For example, if the programs 
using the social reinforcement orientation had also been the more recent studies, that 
would have confounded the results because there has been much more societal pres- 
sure against smoking in recent years. a 


25.4 Criticisms of Meta-Analysis 


Meta-analysis is a relatively new endeavor and not all of the surrounding issues have 
been resolved. Some valid criticisms are still being debated by methodologists. You 
have already encountered one of them—namely, that Simpson’s Paradox or its 
equivalent could apply. 


Some possible problems with meta-analysis are 

1. Simpson’s Paradox 

2. Confounding variables 
. Subtle differences in treatments of the same name 
. The file drawer problem 


3 

4 

5. Biased or flawed original studies 

6. Statistical significance versus practical importance 
7 


. False findings of “no difference” 


The Possibility of Confounding Variables 


Because meta-analysis is essentially observational in nature, the various treatments 
cannot be randomly assigned across studies. Therefore, there may be differences 
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across studies that are confounded with the treatments used. For example, it is com- 
mon for the studies considered in meta-analysis to have been carried out in a variety 
of countries. It could be that cultural differences are confounded with treatment dif- 
ferences. Thus, a meta-analysis of studies relating dietary fat to breast cancer may 
find a strong link, but it may be because the same countries that have high-fat diets 
are also unhealthful in other ways. 


Subtle Differences in Treatments of the Same Name 


Another problem encountered in meta-analysis is that different studies may involve 
subtle differences in treatments but call the different treatments by the same name. 
For instance, chemotherapy may be applied weekly in one study but biweekly in an- 
other. The higher frequency may be too toxic, whereas the lower frequency is bene- 
ficial. When researchers are combining hundreds of studies, not uncommon in 
meta-analysis, they may not take the time to discover these subtle differences, which 
may result in major differences in the outcomes under study. 


The File Drawer Problem 


Related to the question of which studies to include in a meta-analysis is the possi- 
bility that numerous studies may not be discovered by the meta-analyst. These are 
likely to be studies that did not achieve statistical significance and thus were never 
published. This is called the file drawer problem because the assumption is that 
these studies are filed away somewhere but not publicly accessible. If the statistically 
significant studies of a relationship are the ones that are more likely to be available, 
then the meta-analysis may overestimate the size of the relationship. This is akin to 
selecting only those with strong opinions in sample surveys. 

One way researchers deal with the file drawer problem is to contact all persons 
known to be working on a particular topic and ask them if they have done studies 
that were never published. In smaller fields of research, this is probably an effective 
method of retrieving studies. Another way to deal with the problem, suggested by 
psychologist Robert Rosenthal (1991, p. 104), is to estimate how many such studies 
would be needed to reduce the relationship to nonsignificance. If the answer is an 
absurdly large number, then the relationship is probably a real one. 


Biased or Flawed Original Studies 


If the original studies were flawed or biased, so is the meta-analysis. In Tainted 
Truth, Cynthia Crossen (1994, pp. 43-53) discusses the controversy surrounding 
whether oat bran lowers cholesterol. She notes that the final word on the controversy 
came in the form of a meta-analysis published in June 1992 (Ripsin et al., 1992) that 
was funded by Quaker Oats. The study concluded that “this analysis supports the 
hypothesis that incorporating oat products into the diet causes a modest reduction in 
blood cholesterol level” (Ripsin et al., 1992, p. 3317). 

Crossen raises questions about the studies that were included in the meta-analy- 
sis. First, she comments that “of the entire published literature of scientific research 
on oat bran in the U.S., the lion’s share has been at least partly financed by Quaker 
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Oats” (Crossen, 1994, p. 52). In response to Quaker Oats’s defense that no strings 
were attached to the money and that scientists are not going to risk their reputations 
for a small research grant, Crossen replies: 


In most cases, that is true. Nothing is more damaging to scientists’ reputations 
—and their economic survival—than suspicions of fraud, corruption or dishon- 
esty. But scientists are only human, and in the course of any research project 
they make choices. Who were the subjects of the study? Young or old, male or 
female, high cholesterol or low? How were they chosen? How long did the 
study go on? . . . Since cholesterol levels vary season to season, when did the 
study begin and when did it end? (p. 49) 


Statistical Significance versus Practical Importance 


We have already learned that a statistically significant result is not necessarily of any 
practical importance. Meta-analysis is particularly likely to find statistically signifi- 
cant results because typically the combined studies provide very large sample sizes. 
Thus, it is important to learn the magnitude of an effect found in a meta-analysis and 
not just that it is statistically significant. 

For example, the oat bran meta-analysis did indeed show a statistically signifi- 
cant reduction in cholesterol, a conclusion that led to headlines across the country. 
However, the magnitude of the reduction was quite small. A 95% confidence inter- 
val extended from 3.3 mg/dL to 8.4 mg/dL. The average cholesterol level for Amer- 
icans is about 210 mg/dL. 


False Findings of “No Difference” 


It is also possible that a meta-analysis will erroneously reach the conclusion that 
there is no difference or no relationship—when in fact there was simply not enough 
data to find one that was statistically significant. Because a meta-analysis is sup- 
posed to be so comprehensive, there is even greater danger than with single studies 
that the conclusion will be taken to mean that there really is no difference. We will 
see an example of this problem in Case Study 25.2. 

Most meta-analyses include sufficient data so that important differences will be 
detected, and that can lead to a false sense of security. The main danger is when not 
many studies have been done on a particular subset or question. It may be true that 
there are hundreds of thousands of participants across all studies, but if only a small 
fraction of them are in a certain age group, received a certain treatment, and so on, 
then a statistically significant difference may still not be found for that subset. When 
you read about a meta-analysis, be careful not to confuse the overall sample size 
with the one used for any particular subgroup. 


Controversy over Mammograms 


In the fall of 1993, a debate raged between the National Cancer Institute and the 
American Cancer Society. Both organizations had been recommending that at age 40 
women should begin to have mammograms, or breast X rays, as a breast cancer 
screening device. 
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In February 1993, the National Cancer Institute convened an international con- 
ference to bring together experts from around the world to help conduct a meta- 
analysis of studies on the effectiveness of mammography as a screening device. 
Their conclusion about women aged 40-49 years was: “For this age group it is clear 
that in the first 5-7 years after study entry, there is no reduction in mortality from 
breast cancer that can be attributed to screening” (Fletcher et al., 20 October 1993, 
p. 1653). 

The results of the meta-analysis amounted to a withdrawal of the National Can- 
cer Institute’s support for mammograms for women under 50. The American Cancer 
Society refuted the study and announced that it would not change its recommenda- 
tion. As noted in a front-page story in the San Jose Mercury News on November 24, 
1993: 


A spokeswoman for the American Cancer Society’s national office said Tues- 
day that the . . . study would not change the group’s recommendation because 
it was not big enough to draw definite conclusions. The study would have to 
screen I million women to get a certain answer because breast cancer is so un- 
common in young women. 


In fact, the entire meta-analysis considered eight randomized experiments conducted 
over 30 years, including nearly 500,000 women. That sounds impressive. But not all 
of the studies included women under 50. As noted by Sickles and Kopans (20 Octo- 
ber 1993): 


Even pooling the data from all eight randomized controlled trials produces in- 
sufficient statistical power to indicate presence or absence of benefit from 
screening. In the eight trials, there were only 167,000 women (30% of the par- 
ticipants) aged 40-49, a number too small to provide a statistically significant 
result. (p. 1622) 


There were other complaints about the studies as well. Even the participants in the 
conference recognized that there were methodological flaws in some of the studies. 
About one of the studies on women aged 40-49, they commented: 


It is worrisome that more patients in the screening group had advanced tu- 
mors, and this fact may be responsible for the results reported to date. . . . The 
technical quality of mammography early in this trial is of concern, but it is not 
clear to what extent mammography quality affected the study outcome. 
(Fletcher et al., 20 October 1993, p. 1653) 


There is thus sufficient concern about the studies themselves to make the results in- 
conclusive. But they were obviously statistically inconclusive as well. Here is the 
full set of results for women aged 40-49: 


A meta-analysis of six trials found a relative risk of 1.08 (95% confidence in- 
terval = 0.85 to 1.39) after 7 years’ follow-up. After 10-12 years of follow-up, 
none of four trials have found a statistically significant benefit in mortality; a 
combined analysis of Swedish studies showed a statistically insignificant 13% 
decrease in mortality at 12 years. One trial (Health Insurance Plan) has data 
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beyond 12 years of follow-up, and results show a 25% decrease in mortality 
at 10-18 years. Statistical significance of this result is disputed, however. 
(Fletcher et al., 20 October 1993, p. 1644) 


A debate on the MacNeil-Lehrer News Hour between American Cancer Society 
spokesman Dr. Harmon Eyre and one of the authors of the meta-analysis, Dr. Bar- 
bara Rimer, illustrated the difficulty of interpreting results like these for the public. 
Dr. Eyre argued that “30% of those who will die could be saved if you look at the 
data as we do; the 95% confidence limits in the meta-analysis could include a 30% 
decrease in those who will die.” 

Dr. Rimer, on the other hand, simply kept reiterating that mammography for 
women under age 50 simply “hasn’t been shown to save lives.” In response, Dr. Eyre 
accused the National Cancer Institute of having a political agenda. If women under 
50 do not have regular mammograms, it could save a national health-care program 
large amounts of money. 

By now, you should recognize the real nature of the debate here. The results are 
inconclusive because of small sample sizes and possible methodological flaws. The 
question is not a statistical one—it is a public policy one. Given that we do not know 
for certain that mammograms save lives for women under 50, should health insurers 
be required to pay for them? 


Exercises 


Asterisked (*) exercises are included in the Solutions at the back of the book. 


1. Explain why the vote-counting method is not a good way to summarize the re- 
sults in a research area. 

2. Explain why the person deciding which studies to include in a meta-analysis 
should be told how each study was conducted but not the results of the study. 

*3. According to a report in New Scientist (Vine, 21 January 1995), the UK 

Cochrane Centre in Oxford is launching the Cochrane Database of Systematic 
Reviews, which will “focus on a number of diseases, drawing conclusions about 
which treatments work and which do not from all the available randomized con- 
trolled trials” (p. 14). The reviews will be made available electronically to prac- 
ticing physicians and will contain meta-analyses of various areas of medical 
research. 


*a. Why do you think they only planned to include “randomized controlled tri- 
als” (that is, randomized experiments) and not observational studies? 

*b. Do you think they should include studies that did not find statistically sig- 
nificant results when they do their meta-analyses? Why or why not? 

4. Refer to Exercise 3. Pick one of the benefits of meta-analysis listed in Section 
25.3 and explain how that benefit applies to the reviews in the Cochrane 
Database. 

5. Refer to Exercise 3. Pick three of the possible problems with meta-analysis 
listed in Section 25.4 and discuss how they might apply to the reviews in the 
Cochrane Database. 
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*6. 


*9, 


10. 


11. 


An article in the Sacramento Bee (15 April 1998, pp. A1, A12), titled “Drug re- 
actions a top killer, research finds,’ reported a study estimating that between 
76,000 and 137,000 deaths a year in the United States occur due to adverse re- 
actions to medications. Here are two quotes from the article: 


The scientists reached their conclusion not in one large study but by com- 
bining the results of 39 smaller studies. This technique, called meta- 
analysis, can enable researchers to draw statistically significant conclu- 
sions from studies that individually are too small. (p. A12) 

[A critic of the research] said the estimates in Pomeranz’s study might 
be high, because the figures came from large teaching hospitals with the 
sickest patients, where more drug use and higher rates of drug reactions 
would be expected than in smaller hospitals. (p. A12) 


a. Pick one of the benefits in Section 25.3 and one of the criticisms in Section 
25.4 and apply them to this study. 


*b. The study estimated that between 76,000 and 137,000 deaths a year occur 
due to adverse reactions to medications, with a mean estimate of 100,000. If 
this is true, then adverse drug reaction is the fourth-leading cause of death in 
the United States. Based on these results, which one of the criticisms given 
in Section 25.4 is definitely not a problem? Explain. 


. In a meta-analysis, researchers can choose either to combine results across 


studies to produce a single confidence interval or to report separate confidence 
intervals for each study and compare them. Give one advantage and one disad- 
vantage of combining the results into one confidence interval. 


*8. Would the file drawer problem be more likely to present a substantial difficulty 


in a research field with 50 researchers or in one with 1000 researchers? Explain. 


Give two reasons why researchers might not want to include all possible studies 
in a meta-analysis of a certain topic. 


Suppose a meta-analysis on the risks of drinking coffee included 100,000 par- 
ticipants of all ages across 80 studies. One of the conclusions was that a confi- 
dence interval for the relative risk of heart attack for women over 70 who drank 
coffee, compared with those who didn’t, was 0.90 to 1.30. The researchers con- 
cluded that coffee does not present a problem for women over 70. Explain why 
their conclusion is not justified. 


Eisenberg and colleagues (15 June 1993) used meta-analysis to examine the ef- 
fect of “cognitive therapies” such as biofeedback, meditation, and relaxation 
methods on blood pressure. Their abstract states that “cognitive interventions 
for essential hypertension are superior to no therapy but not superior to credible 
sham techniques” (p. 964). Here are some of the details from the text of the 
article: 


Mean blood pressure reductions were smallest when comparing persons 
experimentally treated with those randomly assigned to a credible placebo 
or sham intervention and for whom baseline blood pressure assessments 
were made during a period of more than 1 day. Under these conditions, in- 


12. 


13. 


*14, 
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dividuals treated with cognitive behavioral therapy experienced a mean re- 
duction in systolic blood pressure of 2.8 mm Hg (CI, —0.8 to 6.4 mm Hg) 
. . . relative to controls. (p. 967) 


The comparison included 12 studies involving 368 subjects. Do you agree with 
the conclusion in the abstract that cognitive interventions are not superior to 
credible sham techniques? What would be a better way to word the conclusion? 


Science News (25 January 1995) reported a study on the relationship between 
levels of radon gas in homes and lung cancer. It was a case-control study, with 
538 women with lung cancer and 1183 without lung cancer. The article noted 
that the average level of radon concentration in the homes of the two groups over 
a l-year period was exactly the same. The article also contained the following 
quote, reporting on an editorial by Jonathan Samet that accompanied the origi- 
nal study: 


For the statistical significance needed to assess accurately whether low 
residential exposures constitute no risk . . . the study would have required 
many more participants than the number used in this or any other residen- 
tial-radon study to date. . . . But those numbers may soon become avail- 
able, he adds, as researchers complete a spate of new studies whose data 
can be pooled for reanalysis. (p. 26) 


Write a short report interpreting the quote for someone with no training in 
statistics. 


An article in Science (Mann, 11 November 1994) describes two approaches 
used to try to determine how well programs to improve public schools have 
worked. The first approach was taken by an economist named Eric Hanushek. 
Here is part of the description: 


Hanushek reviewed 38 studies and found the “startlingly consistent” result 
that “there is no strong or systematic relationship between school expendi- 
tures and student performance.” . . . Hanushek’s review used a technique 
called “vote-counting.” (p. 961) 


The other approach was a meta-analysis and the results are reported as follows: 


[The researchers] found systematic positive effects. . . . Indeed, decreased 
class size, increased teacher experience, increased teacher salaries, and 
increased per-pupil spending were all positively related to academic per- 
formance. (p. 962) 


Explain why the two approaches yielded different answers and which one you 
think is more credible. 


Suppose 10 studies were done to assess the relationship between watching vio- 
lence on television and subsequent violent behavior in children. Suppose that 
none of the 10 studies detected a statistically significant relationship. Is it pos- 
sible for a vote-counting procedure to detect a relationship? Is it possible for a 
meta-analysis to detect a relationship? Explain. 
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15. Refer to News Story 9 in the Appendix and on the CD, “Against depression, a 
sugar pill is hard to beat.” There is a meta-analysis of “96 antidepressant trials” 
described in the story. Use that description to answer these questions. 


a. 


b. 


How did the analyst answer the question “Which studies should be in- 
cluded?” 


How did the analyst answer the question “Should results be compared or 
combined?” 


One statement in the story is “in 52 percent of them [the studies], the effect 
of the antidepressant could not be distinguished from that of the placebo.” 
Do you think this statement is an example of “vote-counting’”? What addi- 
tional information would you need to determine whether or not it is? 


. Refer to part c. Assuming the statement is an example of vote-counting, 


should it be used to conclude that the effects of antidepressants and placebos 
are the same? Explain. 


Mini-Projects 


1. Find a journal article that presents a meta-analysis. (They are commonly found 
in medical and psychology journals.) 


a. 
b. 


Give an overview of the article and its conclusions. 


Explain what the researchers used as the criteria for deciding which studies 
to include. 


Explain which of the benefits of meta-analysis were incorporated in the 
article. 


. Discuss each of the potential criticisms of meta-analysis and how they were 


handled in the article. 


Summarize your conclusions about the topic based on the article and your 
answers to parts b-d. 


2. Find a newspaper report of a meta-analysis. Discuss whether important infor- 
mation is missing from the report. Critically evaluate the conclusions of the 
meta-analysis, taking into consideration the criticisms listed in Section 25.4. 
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Ethics in Statistical Studies 


As with any human endeavor, in statistical studies there are ethical considerations 
that must be taken into account. These considerations fall into multiple categories, 
and we discuss the following issues here: 


1. Ethical treatment of human and animal participants 
2. Assurance of data quality 
3. Appropriate statistical analyses 


4. Fair reporting of results 


Most professional societies have a code of ethics that their members are asked to 
follow. Guidelines about the conduct of research studies are often included. For in- 
stance, the American Psychological Association first published a code of ethics in 
1953 and has updated it at least every 10 years since then. In 1999 the American Sta- 
tistical Association, one of the largest organizations of professional statisticians, 
published Ethical Guidelines for Statistical Practice listing 67 recommendations, di- 
vided into categories such as “Professionalism” and “Responsibilities to Research 
Subjects.” 


26.1 Ethical Treatment of Human and Animal Participants 
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There has been increasing attention paid to the ethics of experimental work, partly 
based on some samples of deceptive experiments that led to unintended harm to par- 
ticipants. Here is a classic example of an experiment conducted in the early 1960s 
that would probably be considered unethical today. 


EXAMPLE 1 
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Stanley Milgram’s “Obedience and Individual 

Responsibility” Experiment 

Social psychologist Stanley Milgram was interested in the extent to which ordinary citi- 
zens would obey an authority figure, even if it meant injuring another human being. 
Through newspaper ads offering people money to participate in an experiment on learn- 
ing, he recruited people in the area surrounding Yale University, where he was a mem- 
ber of the faculty. When participants arrived they were greeted by an authoritative 
researcher in a white coat, and introduced to another person who they were told was a 
participant like them but who was actually an actor. Lots were drawn to see who would 
be the “teacher” and who would be the “student,” but in fact it had been predeter- 
mined that the actor would be the “student” and the local citizen would be the 
“teacher.” 

The student/actor was placed in a chair with restricted movement and hooked up to 
what was alleged to be an electrode that administered an electric shock. The teacher 
conducted a memory task with the student and was instructed to administer a shock 
when the student missed an answer. The shocking mechanism was shown to start at 15 
volts and increase in intensity with each wrong answer, up to 450 volts. When the al- 
leged voltage reached 375, it was labeled as “Danger/Severe” and when it reached 435 
it was labeled “XXX.” The experimenter sat at a nearby table, encouraging the teacher 
to continue to administer the shocks. The student/actor would respond with visible and 
increasing distress. The teacher was told that the experimenter would take responsibil- 
ity for any harm that came to the student. 

The disturbing outcome of the experiment was that 65% of the participants con- 
tinued to administer the alleged shocks up to the full intensity, even though many of 
them were quite distressed and nervous about doing so. Even at “very strong” intensity, 
80% of the participants were still administering the electric shocks. (Sources: 
http:/Awww.vanderbilt.edu/AnS/Anthro/Anth101/stanley_milgram_experiment.htm and 
Milgram (1983).) E 


This experiment would be considered unethical today because of the stress it 
caused the participants. Based on this and similar experiments, the American Psy- 
chological Association continues to update its Ethical Principles of Psychologists 
and Code of Conduct. The latest version as of January 2004 can be found at 
http://www.apa.org/ethics/code2002.html. One of the sections of the code is called 
“Deception in Research” and includes three instructions (http://www.apa.org/ethics/ 
code2002.html#8_07); Milgram’s experiment would most likely fail the criterion in 
part (b): 


(a) Psychologists do not conduct a study involving deception unless they have 
determined that the use of deceptive techniques is justified by the study’s sig- 
nificant prospective scientific, educational, or applied value and that effective 
nondeceptive alternative procedures are not feasible. 

(b) Psychologists do not deceive prospective participants about research that 
is reasonably expected to cause physical pain or severe emotional distress. 

(c) Psychologists explain any deception that is an integral feature of the de- 
sign and conduct of an experiment to participants as early as is feasible, 
preferably at the conclusion of their participation, but no later than at the 
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conclusion of the data collection, and permit participants to withdraw their 
data. 


Informed Consent 


Virtually all experiments with human participants require that the researchers obtain 
the informed consent of the participants. In other words, participants are to be told 
what the research is about and given an opportunity to make an informed choice 
about whether to participate. If you were a potential participant in a research study, 
what would you want to know in advance to make an informed choice about partic- 
ipation? Because of such issues as the need for a control group and the use of 
double-blinding, it is often the case that participants cannot be told everything in ad- 
vance. For instance, it would be antithetical to good experimental procedure to tell 
participants in advance if they were taking a drug or a placebo, or to tell them if they 
were in the treatment group or control group. Instead, the use of multiple groups is 
explained and participants are told that they will be randomly assigned to a group 
but will not know what it is until the conclusion of the experiment. 

The information provided in this process is slightly different in research such as 
psychology experiments than it is in medical research. In both cases, participants are 
supposed to be told the nature and purpose of the research and any risks or benefits. 
In medical research the participants generally are suffering from a disease or illness, 
and an additional requirement is that they be informed about alternative treatments. 
Of course, it is unethical to withhold a treatment known to work. 

In section 8.02 of its code of ethics, the American Psychological Association 
provides these guidelines for informed consent in experiments in psychology 
(http://www.apa.org/ethics/code2002.html#8_02): 


(a) When obtaining informed consent as required in Standard 3.10, Informed 
Consent, psychologists inform participants about (1) the purpose of the re- 
search, expected duration, and procedures; (2) their right to decline to partici- 
pate and to withdraw from the research once participation has begun; (3) the 
foreseeable consequences of declining or withdrawing; (4) reasonably foresee- 
able factors that may be expected to influence their willingness to participate 
such as potential risks, discomfort, or adverse effects; (5) any prospective re- 
search benefits; (6) limits of confidentiality; (7) incentives for participation; 
and (8) whom to contact for questions about the research and research partici- 
pants’ rights. They provide opportunity for the prospective participants to ask 
questions and receive answers. 


(b) Psychologists conducting intervention research involving the use of experi- 
mental treatments clarify to participants at the outset of the research (1) the 
experimental nature of the treatment; (2) the services that will or will not be 
available to the control group(s) if appropriate; (3) the means by which 
assignment to treatment and control groups will be made; (4) available treat- 
ment alternatives if an individual does not wish to participate in the research 
or wishes to withdraw once a study has begun; and (5) compensation for or 
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monetary costs of participating including, if appropriate, whether reimburse- 
ment from the participant or a third-party payor will be sought. 


Informed Consent in Medical Research 


The United States Department of Health and Human Services has a detailed policy 
on informed consent practices in medical research, which can be accessed on the In- 
ternet at: http://ohrp.osophs.dhhs.gov/humansubjects/guidance/45cfr46.htm#46. 
116. A more “user friendly” Web site is provided by the Department of Health and 
Human Services Office for Human Research Protections, Office for Protection from 
Research Risks, which has provided the following tips and checklist. The acronym 
IRB stands for Institutional Review Board, which is a board that all research institu- 
tions are required to maintain for oversight of research involving human and animal 
participants. 


Tips on Informed Consent 


The process of obtaining informed consent must comply with the require- 
ments of 45 CFR 46.116. The documentation of informed consent must com- 
ply with 45 CFR 46.117. The following comments may help in the 
development of an approach and proposed language by investigators for ob- 
taining consent and its approval by IRBs: 


= Informed consent is a process, not just a form. Information must be 
presented to enable persons to voluntarily decide whether or not to partici- 
pate as a research subject. It is a fundamental mechanism to ensure respect 
for persons through provision of thoughtful consent for a voluntary act. 
The procedures used in obtaining informed consent should be designed to 
educate the subject population in terms that they can understand. There- 
fore, informed consent language and its documentation (especially expla- 
nation of the study’s purpose, duration, experimental procedures, 
alternatives, risks, and benefits) must be written in “lay language” (i.e., 
understandable to the people being asked to participate). The written pre- 
sentation of information is used to document the basis for consent and for 
the subjects’ future reference. The consent document should be revised 
when deficiencies are noted or when additional information will improve 
the consent process. 


m Use of the first person (e.g., “I understand that .. >”) can be inter- 
preted as suggestive, may be relied upon as a substitute for sufficient 
factual information, and can constitute coercive influence over a sub- 
ject. Use of scientific jargon and legalese is not appropriate. Think of the 
document primarily as a teaching tool not as a legal instrument. 
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m Describe the overall experience that will be encountered. Explain the 


research activity, how it is experimental (e.g., a new drug, extra tests, sep- 
arate research records, or nonstandard means of management, such as flip- 
ping a coin for random assignment or other design issues). Inform the 
human subjects of the reasonably foreseeable harms, discomforts, incon- 
venience, and risks that are associated with the research activity. If addi- 
tional risks are identified during the course of the research, the consent 
process and documentation will require revisions to inform subjects as 
they are recontacted or newly contacted. 


Describe the benefits that subjects may reasonably expect to en- 
counter. There may be none other than a sense of helping the public at 
large. If payment is given to defray the incurred expense for participation, 
it must not be coercive in amount or method of distribution. 


Describe any alternatives to participating in the research project. For 
example, in drug studies the medication(s) may be available through their 
family doctor or clinic without the need to volunteer for the research 
activity. 


The regulations insist that the subjects be told the extent to which 
their personally identifiable private information will be held in confi- 
dence. For example, some studies require disclosure of information to 
other parties. Some studies inherently are in need of a Certificate of Con- 
fidentiality which protects the investigator from involuntary release (e.g., 
subpoena) of the names or other identifying characteristics of research 
subjects. The IRB will determine the level of adequate requirements for 
confidentiality in light of its mandate to ensure minimization of risk and 
determination that the residual risks warrant involvement of subjects. 


If research-related injury (i.e., physical, psychological, social, finan- 
cial, or otherwise) is possible in research that is more than minimal 
risk (see 45 CFR 46.102[g]), an explanation must be given of whatever 
voluntary compensation and treatment will be provided. Note that the 
regulations do not limit injury to “physical injury.” This is a common mis- 
interpretation. 


The regulations prohibit waiving or appearing to waive any legal 
rights of subjects. Therefore, for example, consent language must be 
carefully selected that deals with what the institution is voluntarily willing 
to do under circumstances such as providing for compensation beyond the 
provision of immediate or therapeutic intervention in response to a 
research-related injury. In short, subjects should not be given the impres- 
sion that they have agreed to and are without recourse to seek satisfaction 
beyond the institution’s voluntarily chosen limits. 


The regulations provide for the identification of contact persons who 
would be knowledgeable to answer questions of subjects about the re- 
search, rights as a research subject, and research-related injuries. These 
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three areas must be explicitly stated and addressed in the consent 
process and documentation. Furthermore, a single person is not likely to 
be appropriate to answer questions in all areas. This is because of potential 
conflicts of interest or the appearance of such. Questions about the re- 
search are frequently best answered by the investigator(s). However, ques- 
tions about the rights of research subjects or research-related injuries 
(where applicable) may best be referred to those not on the research team. 
These questions could be addressed to the IRB, an ombudsman, an ethics 
committee, or other informed administrative body. Therefore, each consent 
document can be expected to have at least two names with local telephone 
numbers for contacts to answer questions in these specified areas. 


The statement regarding voluntary participation and the right to 
withdraw at any time can be taken almost verbatim from the regula- 
tions (45 CFR 46.116[a][8]). It is important not to overlook the need to 
point out that no penalty or loss of benefits will occur as a result of both 
not participating or withdrawing at any time. It is equally important to 
alert potential subjects to any foreseeable consequences to them should 
they unilaterally withdraw while dependent on some intervention to main- 
tain normal function. 


Don’t forget to ensure provision for appropriate additional require- 
ments which concern consent. Some of these requirements can be found 
in sections 46.116(b), 46.205(a)(2), 46.207(b), 46.208(b), 46.209(d), 
46.305(a)(5-6), 46.408(c), and 46.409(b). The IRB may impose additional 
requirements that are not specifically listed in the regulations to ensure 
that adequate information is presented in accordance with institutional 
policy and local law. 


Source: http://ohrp.osophs.dhhs.gov/humansubjects/guidance/ictips.htm 


Informed Consent Checklist 


Basic and Additional Elements 


A statement that the study involves research 


An explanation of the purposes of the research 


The expected duration of the subject’s participation 


A description of the procedures to be followed 


Identification of any procedures which are experimental 


A description of any reasonably foreseeable risks or discomforts to the 
subject 


A description of any benefits to the subject or to others which may rea- 
sonably be expected from the research 
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A disclosure of appropriate alternative procedures or courses of treat- 
ment, if any, that might be advantageous to the subject 


A statement describing the extent, if any, to which confidentiality of 
records identifying the subject will be maintained 


For research involving more than minimal risk, an explanation as to 
whether any compensation, and an explanation as to whether any medical 
treatments are available, if injury occurs and, if so, what they consist of, 
or where further information may be obtained 


Research Qs 
Rights Qs 
Injury Qs 


An explanation of whom to contact for answers to pertinent questions 
about the research and research subjects’ rights, and whom to contact in 
the event of a research-related injury to the subject 


A statement that participation is voluntary, refusal to participate will in- 
volve no penalty or loss of benefits to which the subject is otherwise enti- 
tled, and the subject may discontinue participation at any time without 
penalty or loss of benefits, to which the subject is otherwise entitled 


Additional Elements, as Appropriate 


A statement that the particular treatment or procedure may involve risks 
to the subject (or to the embryo or fetus, if the subject is or may become 
pregnant), which are currently unforeseeable 


Anticipated circumstances under which the subject’s participation may be 
terminated by the investigator without regard to the subject’s consent 


Any additional costs to the subject that may result from participation in 
the research 


The consequences of a subject’s decision to withdraw from the research 
and procedures for orderly termination of participation by the subject 


A statement that significant new findings developed during the course of 
the research, which may relate to the subject’s willingness to continue 
participation, will be provided to the subject 


The approximate number of subjects involved in the study 


Source: http://ohrp.osophs.dhhs.gov/humansubjects/assurance/consentckls.htm 


As you can see, the process of obtaining informed consent can be daunting and 
some potential participants may opt out because of an excess of information. There 
are additional issues that arise when participants are in certain categories. For ex- 
ample, young children cannot be expected to fully understand the informed consent 
process. The regulations state that children should be involved in the decision- 
making process to the extent possible and that a parent must also be involved. There 
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are special rules applying to research on prisoners and to research on “Pregnant 
Women, Human Fetuses and Neonates.” These can be found on the Internet at 
http://ohrp.osophs.dhhs.gov/humansubjects/guidance/45cfr46.htm. 


Research on Animals 


Perhaps nothing is as controversial in research as the use of animals. There are some 
who believe that any experimentation on animals is unethical, whereas others argue 
that humans have benefited immensely from such research and that justifies its use. 
There is no doubt that in the past some research with animals, as with humans, has 
been clearly unethical. But again the professions of medicine and behavioral sci- 
ences, which are the two arenas in which animal research is most prevalent, have de- 
veloped ethical guidelines that their members are supposed to follow. 

Nonetheless, animals do not have the rights of humans as participants in experi- 
ments. Here are two of the elements of the American Psychological Association’s 
Code of Ethics section on “Humane Care and Use of Animals in Research.” Clearly, 
these statements would not be made regarding human subjects and even these are 
stated ideals that many researchers may not follow. (Source: http://www.apa.org/ 
ethics/code2002.html#8_09) 


(e) Psychologists use a procedure subjecting animals to pain, stress, or priva- 
tion only when an alternative procedure is unavailable and the goal is justified 
by its prospective scientific, educational, or applied value. 


(g) When it is appropriate that an animal’s life be terminated, psychologists 
proceed rapidly, with an effort to minimize pain and in accordance with ac- 
cepted procedures. 


The United States National Institutes of Health has also established the Public 
Health Service Policy on Humane Care and Use of Laboratory Animals, which can 
be found at http://www.grants.nih.gov/grants/olaw/references/phspol.htm. However, 
the guidelines for research on animals are not nearly as detailed as they are for hu- 
man subjects. A research study of the approval of animal-use protocols found strong 
inconsistences when protocols approved or disapproved by one institution were sub- 
mitted for approval to another institution’s review board. The decisions of the two 
boards to approve or disapprove the protocols agreed at no better than chance levels, 
suggesting that the guidelines for approval of animal research are not spelled out 
in sufficient detail (Plous and Herzog, 2001). For more information and links to 
a number of Web sites on the topic of research with animals, visit http://www. 
socialpsychology.org/methods.htm#animals. 


26.2 Assurance of Data Quality 


When reading the results of a study, you should be able to have reasonable assurance 
that the researchers collected data of high quality. This is not as easy as it sounds, 
and we have explored many of the difficulties and disasters of collecting and inter- 
preting data for research studies in Part | of the book. 
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The quality of data becomes an ethical issue when there are personal, political, 
or financial reasons motivating one or more of those involved in the research and 
steps are not taken to assure the integrity of the data. Data quality is also an ethical 
issue when researchers knowingly fail to report problems with data collection that 
may distort the interpretation of the results. As a simple example, survey data should 
always be reported with an explanation of how the sample was selected, what ques- 
tions were asked, who responded, and how they may differ from those who didn’t 
respond. 


United States Federal Statistical Agencies 


The United States government spends large amounts of money to collect and dis- 
seminate statistical information on a wide variety of topics. In recent years, there has 
been an increased focus on making sure the data are of high quality. The Federal 
Register Notices on June 4, 2002 (vol. 67, no. 107) provided a report called Federal 
Statistical Organizations’ Guidelines for Ensuring and Maximizing the Quality, Ob- 
jectivity, Utility, and Integrity of Disseminated Information. (The report can be found 
on the Web at http://www.fedstats.gov/policy/Stat-A gency-FR-June4-2002.pdf.) The 
purpose of the report was to provide notice that each of over a dozen participating 
federal statistical agencies were making such guidelines available for public com- 
ment. (A list of federal statistical agencies with links to their Web sites and the data 
they provide can be found at www.fedstats.gov.) 

Most of the guidelines proposed by the statistical agencies were common-sense, 
self-evident good practice. As an example, the following box includes data collec- 
tion principles from the Bureau of Transportation Statistics Web site. 


3.1 Data Collection Operations 
Principles 


m Forms, questionnaires, automated collection screens, and file layouts are 
the medium through which data are collected. They consist of sets of 
questions or annotated blanks on paper or computer that request informa- 
tion from data suppliers. They need to be designed to maximize communi- 
cation to the data supplier. 


m Data collection includes all the processes involved in carrying out the data 
collection design to acquire data. Data collection operations can have a 
high impact on the ultimate data quality, especially when they deviate 
from the design. 


m The data collection method should be appropriate to the data complexity, 
collection size, data requirements, and amount of time available. 
For example, a reporting system will often primarily rely on the required 
reporting mechanism, with follow-up for missing data. Similarly, a large 
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survey requiring a high response rate will often start off with a mail out, 
followed by telephone contact, and finally by a personal visit. 


= Specific data collection environmental choices can significantly affect er- 
ror introduced at the collection stage. 


For example, if the data collector is collecting as a collateral duty or is work- 
ing in an uncomfortable environment, it may adversely affect the quality of the 
data collected. Also, if the data are particularly difficult to collect, it will af- 
fect the data quality. 


= Conversion of data on paper to electronic form (e.g., key entry, scanning) 
introduces a certain amount of error which must be controlled. 

m Third-party sources of data introduce error in their collection processes. 

= Computer-assisted information collection can result in more timely and 
accurate information. Initial development costs will be higher, and much 
more lead time will be required to develop, program, and test the data col- 
lection system. However, the data can be checked and corrected when 
originally entered, key-entry error is eliminated, and the lag between data 
collection and data availability is reduced. 


m The use of sensors for data can significantly reduce error. 


Source: http://www.bts.gov/statpol/guide/chapter3.html 


So where does this fit in a discussion of ethics? Although it is laudable that the 
federal statistical agencies are spelling out these principles, the ethical issues in- 
volved with assuring the quality of government data are more subtle than can be 
evoked from general principles. There are always judgments to be made, and poli- 
tics can sometimes enter into those judgments. 

For instance, a report by the National Research Council (Citro and Norwood, 
1997) examined many aspects of the functioning of the Bureau of Transportation 
Statistics (BTS) and formulated suggestions for improvement. The BTS was created 
in 1992 and thus had only been in existence for 5 years at the time of the report. One 
of the data-quality issues discussed in the report was “comparability [of statistics 
about transportation issues] across data systems and time” (p. 32). One example 
given was that the definition of a transportation fatality was not consistent across 
modes of transportation until 1994, when consistency was mandated by the Secre- 
tary of Transportation. Before that, a highway traffic fatality was counted if death re- 
sulted from the accident within 30 days. But for a railroad accident, the time limit 
was 365 days. Continuing to use different standards for different modes of trans- 
portation would make for unfair safety comparisons. In an era where various trans- 
portation modes compete for federal dollars, politics could easily enter a decision to 
report fatality statistics one way or another. 
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Probably the most controversial federal data collection issue surrounds the de- 
cennial United States census, which has been conducted every 10 years since 1790. 
In 1988 a lawsuit was filed, led by New York City, alleging that urban citizens are 
undercounted in the census process. The political ramifications are enormous be- 
cause redistricting of congressional seats results from shifting population counts. 
The lawsuit began a long saga of political and statistical twists and turns that was 
still unresolved at the taking of the 2000 census. For an interesting and readable ac- 
count, see Who Counts? The Politics of Census-Taking in Contemporary America, 
by Anderson and Fienberg (2001). 

A report by the National Research Council’s Committee on National Statistics 
(Martin, Straf, and Citro, 2001), called Principles and Practices for a Federal Sta- 
tistical Agency, enumerated three principles and eleven practices that federal agen- 
cies should follow. One of the recommended practices was “a strong position of 
independence.” The recommendation included a clear statement about separating 
politics from the role of statistical agencies: 


In essence, a statistical agency must be distinct from those parts of the depart- 
ment that carry out enforcement and policy-making activities. It must be im- 
partial and avoid even the appearance that its collection, analysis, and 
reporting processes might be manipulated for political purposes or that indi- 
vidually identifiable data might be turned over for administrative, regulatory, 
or enforcement purposes. 

The circumstances of different agencies may govern the form that indepen- 
dence takes. In some cases, the legislation that establishes the agency may 
specify that the agency head be professionally qualified, be appointed by the 
President and confirmed by the Senate, serve for a specific term not coincident 
with that of the administration, and have direct access to the secretary of the 
department in which the agency is located. (p. 6) 


It should be apparent to you that it is not easy to maintain complete indepen- 
dence from politics; for instance, many of the heads of these agencies are appointed 
by the president and confirmed by the Senate. Other steps taken to help maintain in- 
dependence include prescheduled release of important statistical information such as 
unemployment rates and authority to release information without approval from the 
policy-making branches of the organization. 


Experimenter Effects and Personal Bias 


We learned in Part | that there are numerous ways in which experimenter effects can 
bias statistical studies. If a researcher has a desired outcome for a study and if con- 
ditions are not very carefully controlled, it is quite likely that the researcher will in- 
fluence the outcome. Here are some of the precautions that may be implemented to 
help prevent this from happening: 


m Randomization done by a third party with no vested interest in the experiment, or 
at least done by a well-tested computer randomization device. 


m Automated data recording without intervention from the researcher. 


EXAMPLE 2 
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m Double-blind procedures to ensure that no one who has contact with the partici- 
pants knows which treatment or condition they are receiving. 


m An honest evaluation that what is being measured is appropriate and unbiased for 
the research question of interest. 


m A standard protocol for the treatment of all participants that must be strictly 
followed. 


Janet's (Hypothetical) Dissertation Research 


This is a hypothetical example to illustrate some of the subtle (and not so subtle) ways ex- 
perimenter bias can alter the data collected for a study. Janet is a Ph.D. student and is un- 
der tremendous pressure to complete her research successfully. For her study, she 
hypothesized that role-playing assertiveness training for women would help them learn 
to say “no” to telephone solicitors. She recruited 50 undergraduate women as volunteers 
for the study. The plan was that each volunteer would come to her office for half an hour. 
For 25 of the volunteers (the control group) she would simply talk with them for 30 min- 
utes about a variety of topics, including their feelings about saying “no” to unwanted re- 
quests. For the other 25 volunteers Janet would spend 15 minutes on similar discussion 
and the remaining 15 minutes on a prespecified role-playing scenario in which the vol- 
unteer got to practice saying “no” in various situations. Two weeks after each volunteer's 
visit, Brad, a colleague of Janet's, would phone them anonymously, pretending to be a 
telephone solicitor selling a magazine for a good price, and would record the conversa- 
tion so that Janet could determine whether or not they were able to say “no.” 

It's the first day of the experiment and the first volunteer is in Janet's office. Janet 
has the randomization list that was prepared by someone else, which randomly assigns 
each of the 50 volunteers to either Group 1 or Group 2. This first volunteer appears to 
be particularly timid, and Janet is sure she won't be able to learn to say “no” to anyone. 
The randomization list says that volunteer 1 is to be in Group 2. But what was Group 2? 
Did she say in advance? She can’t remember. Oh well, Group 2 will be the control group. 

The next volunteer comes in and, according to the randomization list, volunteer 2 is 
to be assigned to Group 1, which now is defined to be the role-playing group. Janet fol- 
lows her predefined protocol for the half hour. But when the half hour is over, the stu- 
dent doesn't want to leave. Just then Brad comes by to say “hello,” and the three of 
them spend another half hour having an amiable conversation. 

The second phase of the experiment begins and Brad begins phoning the volun- 
teers. The conversations are recorded so that Janet can assess the results. When listen- 
ing to volunteer 2's conversation, Janet notices that almost immediately she says to Brad, 
“Your voice sounds awfully familiar, do | know you?” When he assures her that she does 
not and asks her to buy the magazine, she says, “I can’t place my finger on it, but this 
is a trick, right? I’m sure | know your voice. No thanks, no magazine!” Janet records the 
data: a successful “no” to the solicitation. 

Janet listens to another call, and although she is supposed to be blind to which 
group the person was in, she recognizes the voice as being one of the role-playing vol- 
unteers. Brad pitches the magazine to her and her response is “Oh, | already get that 
magazine. But if you are selling any others, | might be able to buy one.” Janet records 
the data: a successful “no” to the question of whether she wants to buy that magazine. 

The second phase is finally over and Janet has a list of the results. But now she no- 
tices a problem. There are 26 people listed in the control group and 24 listed in the role- 
playing group. She tries to resolve the discrepancy but can’t figure it out. She notices 
that the last two people on the list are both in the control group and that they both said 
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“no” to the solicitation. She figures she will just randomly choose one of them to move 
over to the role-playing group, so she flips a coin to decide which one to move. E 


This example illustrates just a few of the many ways in which experimenter bias 
can enter into a research study. Even when it appears that protocols are carefully in 
place, in the real world it is nearly impossible to place controls on all aspects of the 
research. In this case, notice that every decision Janet made benefited her desired con- 
clusion that the role-playing group would learn to say “no.” It is up to the researchers 
to take utmost care not to allow these kinds of unethical influences on the results. 


26.3 Appropriate Statistical Analyses 


EXAMPLE 3 


There are a number of decisions that need to be made when analyzing the results of 
a study, and care must be taken not to allow biases to affect those decisions. Here are 
some examples of decisions that can influence the results: 


m Should the alternative hypothesis be one-sided or two-sided? This decision must 
be made before examining the data. 


m What level of confidence or level of significance should be used? 
m How should outliers be handled? 
m [f more than one statistical method is available, which one is most appropriate? 


m Have all appropriate conditions and assumptions been investigated and verified? 


One of the easiest ethical blunders to make is to try different methods of analy- 
sis until one produces the desired results. Ideally, the planned analysis should be 
spelled out in advance. If various analysis methods are attempted, all analyses should 
be reported along with the results. 


Jake’s (Hypothetical) Fishing Expedition 

Jake was doing a project for his statistics class and he knew it was important to find a 
statistically significant result because all of the interesting examples in class had statisti- 
cally significant results. He decided to compare the memory skills of males and females, 
but he did not have a preconceived idea of who should do better, so he planned to do 
a two-sided test. He constructed a memory test in which he presented people with a list 
of 100 words and allowed them to study it for 10 minutes. Then the next day, he pre- 
sented another list of 100 words to the participants and asked them to indicate for each 
word whether it had been on the list the day before. The answers were entered into a 
bubble sheet and scored by computer so that Jake didn’t inadvertently misreport any 
scores. The data for number of correct answers were as follows: 


Males: 69, 70, 71, 61, 73, 68, 70, 69, 67, 72, 64, 72, 65, 70, 100 

Females: 64, 74, 72, 76, 64, 72, 76, 80, 72, 73, 71, 70, 64, 76, 70 

Jake remembered that they had been taught two different tests for comparing in- 
dependent samples, called a two-sample t-test to compare means and a Mann-Whitney 


rank-sum test to compare medians. He decided to try them both to see which one gave 
better results. He found the following computer results: 
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Two-Sample T-Test and CI: Males, Females 


Two-sample T for Males vs Females 


N Mean StDev SE Mean 
Males 15 70.73 8.73 Bes 
Females 15 71.60 4.75 1.2 


Difference = mu Males - mu Females 

Estimate for difference: -0.87 

95% CI for difference: (-6.20, 4.47) 

T-Test of difference = 0 (vs not =): T-Value = -0.34 P-Value = 
0.739 DF = 21 


Mann-Whitney Test and CI: Males, Females 


Males N = 15 Median = 70.00 

Females N = 15 Median = 72.00 

Point estimate for ETA1-ETA2 is -3.00 

95.4 Percent CI for ETA1-ETA2 is (-6.00,1.00) 

W = 193.5 

Test of ETA1 = ETA2 vs ETA1 not = ETA2 is significant at 0.1103 


The test is significant at 0.1080 (adjusted for ties) 
Cannot reject at alpha = 0.05 


Jake was disappointed that neither test gave a small enough p-value to reject the 
null hypothesis. He was also surprised by how different the results were. The t-test pro- 
duced a p-value of 0.739, whereas the Mann-Whitney test produced a p-value of 0.108, 
adjusted for ties. It dawned on him that maybe he should conduct a one-tailed test in- 
stead. After all, it was clear that the mean and median for the females were both higher, 
so he decided that the alternative hypothesis should be that females would do better. 
He reran both tests. This time, the p-value for the t-test was 0.369 and for the 
Mann-Whitney test it was 0.054. Maybe he should simply round that one off to 0.05 
and be done. 

But Jake began to wonder why the tests produced such different results. He looked 
at the data and realized that there was a large outlier in the data for the males. Some- 
one scored 100%! Jake thought that must be impossible. He knew that he shouldn't 
just remove an outlier, so he decided to replace it with the median for the males, 70. 
Just to be sure he was being fair, he reran the original two-sided hypothesis tests. This 
time, the p-value for the t-test was 0.066 and the p-value for the Mann-Whitney test 
(adjusted for ties) was 0.0385. Finally! Jake wrote his analysis explaining that he did a 
two-sided test because he didn’t have a preconceived idea of whether males or females 
should do better. He said that he decided to do the Mann-Whitney test because he had 
small samples and he knew that the t-test wasn’t always appropriate with sample sizes 
less than 30. He didn’t mention replacing the outlier because he didn’t think it was a le- 
gitimate value anyway. 

Although this hypothetical situation is an exaggeration to make a point, hopefully it 
illustrates the dangers of “data-snooping.” If you manipulate the data and try enough 
different procedures, something will eventually produce desired results. It is not ethical 
to keep trying different methods of analysis until one produces a desired result. a 


As a final example, read Example 3 in Chapter 20, “The Debate over Passive 
Smoking.” The issue in that example was if it was ethical to report a 90% confidence 
interval instead of using the standard 95% confidence level. 
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What do you think? Should the EPA have used the common standard, and re- 
ported a 95% confidence interval? If they had done so, the interval would have in- 
cluded values indicating that the risk of lung cancer for those who are exposed to 
passive smoke may actually be lower than the risk for those who are not. The 90% 
confidence interval reported was narrower and did not include those values. 


26.4 Fair Reporting of Results 


Research results are usually reported in articles published in professional journals. 
Most researchers are careful to report the details of their research, but there are more 
subtle issues that researchers and journalists sometimes ignore. There are also more 
blatant reporting biases that can mislead readers, usually in the direction of making 
stronger conclusions than are appropriate. We have discussed some of the problems 
with interpreting statistical inference results in earlier chapters, and some of those 
problems are related to how results are reported. Let’s revisit some of those, and dis- 
cuss some other possible ethical issues that can arise when reporting the results of 
research. 


Sample Size and Statistical Significance 


Remember that whether a study achieves statistical significance depends not only on 
the magnitude of whatever effect or relationship may actually exist but also on the 
size of the study. In particular, if a study fails to find a statistically significant result, 
it is important to include a discussion of the sample size used and the power to de- 
tect an effect that would result from that sample size. Often, research will be reported 
as having found no effect or no difference, when in fact the study had such low 
power that even if an effect exists the study would have been unlikely to detect it. 

One of the ethical responsibilities of a researcher is to collect enough data to 
have relatively high probability of finding an effect if indeed it really does exist. In 
other words, it is the responsibility of the researcher to determine an appropriate 
sample size in advance, as well as to discuss the issue of power when the results are 
presented. A study that is too small to have adequate power is a waste of everyone’s 
time and money. 

As we have discussed earlier in the book, the other side of the problem is the 
recognition that statistical significance does not necessarily imply practical impor- 
tance. If possible, researchers should report the magnitude of an effect or difference 
through the use of a confidence interval, rather than just reporting a p-value. 


Multiple Hypothesis Tests and Selective Reporting 


In most research studies there are multiple outcomes measured. Many hypothesis 
tests may be done, looking for whatever statistically significant relationships may 
exist. For example, a study may ask people to record their dietary intake over a long 
time period so that the investigators can then look for foods that are correlated with 


EXAMPLE 4 
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certain health outcomes. The problem is that even if there are no legitimate relation- 
ships in the population, something in the sample may be statistically significant just 
by chance. It is unethical to report conclusions only about the results that were sta- 
tistically significant, without informing the reader about all of the tests that were 
done. 

The Ethical Guidelines of the American Statistical Association (1999) lists the 
problem under “Professionalism” as follows: 


Recognize that any frequentist statistical test has a random chance of indicat- 
ing significance when it is not really present. Running multiple tests on the 
same data set at the same stage of an analysis increases the chance of obtain- 
ing at least one invalid result. Selecting the one “significant” result from a 
multiplicity of parallel tests poses a grave risk of an incorrect conclusion. Fail- 
ure to disclose the full extent of tests and their results in such a case would be 
highly misleading. 


In many cases it is not the researcher who makes this mistake, it is the media. 
The media is naturally interested in surprising or important conclusions, not in re- 
sults showing that there is nothing going on. For instance, the story of interest will 
be that a particular food is related to higher cancer incidence, not that 30 other foods 
did not show that relationship. It is unethical for the media to publicize such results 
without explaining the possibility that multiple testing may be responsible for un- 
covering a spurious chance relationship. 

Even if there is fairly strong evidence that observed statistically significant rela- 
tionships represent real relationships in the population, the media should mention 
other, less interesting results because they may be important for people making 
lifestyle decisions. For example, if the relationship between certain foods and dis- 
ease are explored, it is interesting to know which foods do not appear to be related 
to the disease as well as those that appear to be related. 


Helpful and Harmful Outcomes from 
Hormone Replacement Therapy 


In July 2002 the results were released from a large clinical trial studying the effects of 
estrogen plus progestin hormone replacement for postmenopausal women. The trial 
was stopped early because of increased risk of breast cancer and coronary heart disease 
among the women taking the hormones. However, many news stories failed to report 
some of the other results of the study, which showed that the hormones actually de- 
creased the risk of other adverse outcomes and were unresolved about others. The orig- 
inal article (Writing Group for the Women’s Health Initiative Investigators, 2002) 
reported the results as follows: 


Absolute excess risks per 10,000 person-years attributable to estrogen plus pro- 
gestin were 7 more CHD [coronary heart disease] events, 8 more strokes, 8 more 
PEs [pulmonary embolism], 8 more invasive breast cancers, while absolute risk re- 
ductions per 10,000 person-years were 6 fewer colorectal cancers and 5 fewer hip 
fractures. 


These results show that in fact some outcomes were more favorable for those taking the 
hormones, specifically colorectal cancer and hip fractures. Because different people are 
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CASE STUDY 26.1 


at varying risk for certain diseases, it is important to report all of these outcomes so that 
an individual can make an informed choice about whether to take the hormones. In fact, 
overall, 231 out of 8506 women taking the hormones died of any cause during the 
study, which is 2.72%. Of the 8102 women taking the placebo, 218 or 2.69% died, a 
virtually identical result with the hormone group. In fact, when the results are adjusted 
for the time spent in the study, the death rate was slightly lower in the hormone group, 
with an annualized rate of 0.52% compared with 0.53% in the placebo group. 

The purpose of this example is not to negate the serious and unexpected outcomes 
related to heart disease, which hormones were thought to protect against, or the seri- 
ous breast cancer outcome. Instead, the purpose is to show that the results of a large 
and complex study such as this one are bound to be mixed and should be presented as 
such so that readers can make informed decisions. a 


Making Stronger or Weaker Conclusions 
than Are Justified 


As we have learned throughout this book, for many reasons research studies cannot 
be expected to measure all possible influences on a particular outcome. They are also 
likely to have problems with ecological validity, in which the fact that someone is 
participating in a study is enough to change behavior or produce a result that would 
not naturally occur. It is important that research results be presented with these is- 
sues in mind and that the case is not overstated even when an effect or relationship 
is found. An obvious example, often discussed in this book, is that a cause-and- 
effect conclusion cannot generally be made on the basis of an observational study. 

However, there are more subtle ways in which conclusions may be made that are 
stronger or weaker than is justified. For example, often little attention is paid to how 
representative the sample is of a larger population, and results are presented as if 
they would apply to all men, or all women, or all adults. It is important to consider 
and report an accurate assessment of who the participants in the study are really 
likely to represent. 

Sometimes there are financial or political pressures that can lead to stronger (or 
weaker) conclusions than are justified. Results of certain studies may be suppressed 
while others are published, if some studies support the desired outcome and others 
don’t. As cautioned throughout this book, when you read the results of a study that 
has personal importance to you, try to gain access to as much information as possi- 
ble about what was done, who was included, and all of the analyses and results. 


Science Fair Project or Fair Science Project? 


In 1998 a fourth-grade girl’s science project received extensive media coverage af- 
ter it led to publication in the Journal of the American Medical Association (Rosa et 
al., 1998). Later that year, in its December 9, 1998 issue, the journal published a se- 
ries of letters criticizing the study and its conclusions on a wide variety of issues. 
There are a number of ethical issues related to this study, some of which were raised 
by the letters and others that have not been raised before. 

The study was supposed to be examining “therapeutic touch” (TT), a procedure 
practiced by many nurses that involves working with patients through a five-step 


CHAPTER 26 Ethics in Statistical Studies 501 


process, including a sensing and balancing of their “energy.” The experiment pro- 
ceeded as follows. Twenty-one self-described therapeutic touch practitioners partic- 
ipated. They were asked to sit behind a cardboard screen and place their hands 
through cutout holes, resting them on a table on the other side of the screen. The 
9-year-old “experimenter” then flipped a coin and used the outcome to decide which 
of the practitioner’s hands to hold her hand over. The practitioner was to guess which 
hand the girl was hovering over. Fourteen of the practitioners contributed 10 tries 
each, and the remaining 7 contributed 20 tries each, for a total of 280 tries. 

The paper had four authors, including the child and her mother. It is clear from 
the affiliations of the authors of the paper as well as from language throughout the 
paper that the authors were biased against therapeutic touch before the experiment 
began. For example, the first line read “therapeutic touch (TT) is a widely used nurs- 
ing practice rooted in mysticism but alleged to have a scientific basis” (Rosa et al., 
1998, p. 1005). The first author of the paper was the child’s mother and her affilia- 
tion is listed as “the Questionable Nurse Practices Task Force, National Council 
Against Health Fraud, Inc.” 

The paper concludes that “twenty-one experienced TT practitioners were unable 
to detect the investigator’s ‘energy field.’ Their failure to substantiate TT’s most fun- 
damental claim is unrefuted evidence that the claims of TT are groundless and that 
further professional use is unjustified.” The conclusion was widely reported in the 
media, presumably at least in part because of the novelty of a child having done the 
research. That would have been cute if it hadn’t been taken so seriously by people 
on both sides of the debate on the validity of therapeutic touch. 

The letters responding to the study point out many problematic issues with how 
the study was done and with its conclusions. Here are several quotes: 


The experiments described are an artificial demonstration that some number of 
self-described mystics were unable to “sense the field” of the primary investi- 
gator’s 9-year-old daughter. This hardly demonstrates or debunks the efficacy 
of TT. The vaguely described recruitment method does not ensure or even sug- 
gest that the subjects being tested were actually skilled practitioners. More im- 
portant, the experiments described are not relevant to the clinical issue 
supposedly being researched. Therapeutic touch is not a parlor trick and 
should not be investigated as such. (Freinkel, 1998) 


To describe this child’s homework as “research” is without foundation since it 
clearly fails to meet the criteria of randomization, control, and valid interven- 
tion. ... Flagrant violations against TT include the fact that “sensing” an en- 
ergy field is not TT but rather a nonessential element in the 5-step process; 
inclusion of many misrepresentations of cited sources; use of inflammatory lan- 
guage that indicates significant author bias; and bias introduced by the child 
conducting the project being involved in the actual trials. (Carpenter et al., 
1998) 


I critiqued the study on TT and was amazed that a research study with so 
many flaws could be published. . . . The procedure was conducted in different 
settings with no control of environmental conditions. Even though the trials 
were repeated, the subjects did not change, thus claims of power based on 
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possible repetitions of error are inappropriate. The true numbers in groups are 
15 and 13, thus making a type II error highly probable with a study power of 
less than 30%. 

Another concern is whether participants signed informed consent docu- 
ments or at least were truly informed as to the nature of this study and that 
publication of its results would be sought beyond a report to the fourth-grade 
teacher. (Schmidt, 1998) 


As can be seen by these reader comments, many of the ethical issues covered in 
this chapter may cloud the results of this study. However, there are two additional 
points that were not raised in any of the published letters. First, it is very likely that 
the child knew that her mother wanted the results to show that the participants would 
not be able to detect which hand was being hovered over. And what child does not 
want to please her mother? That would not matter so much if there hadn’t been so 
much room for “experimenter effects” to influence the results. One example is the 
randomization procedure. The young girl flipped a coin each time to determine 
which hand to hover over. Coin tosses are very easy to influence, and presumably 
even a 9-year-old child could pick up the response biases of the subjects. The fact 
that a proper randomization method wasn’t used should have ended any chance that 
this experiment would be taken seriously. 

Is there evidence that experimenter bias may have entered the experiment? Ab- 
solutely. Of the 280 tries, the correct hand was identified in 123 (44%) of them. The 
authors of the article conclude that this number is “close to what would be expected 
for random chance.” In fact, that is not the case. The chance of getting 123 or fewer 
guesses by chance is only 0.0242. If a two-tailed test had been used instead of a one- 
tailed test, the p-value would have been 0.048, a statistically significant outcome. 
The 9-year-old did an excellent job of fulfilling her mother’s expectations. 


Exercises 


Asterisked (*) exercises are included in the Solutions at the back of the book. 


1. Visit the Web sites of one or more professional organizations related to your ma- 
jor. (You may need to ask one of your professors to help with this.) Find one that 
has a code of ethics. Describe whether the code includes anything about re- 
search methods. If not, explain why you think nothing is included. If so, briefly 
summarize what is included. Include the Web site address with your answer. 


2. A classic psychology experiment was conducted by psychology professor Philip 
Zimbardo in the summer of 1971 at Stanford University. The experiment is de- 
scribed at the Web site http://www.prisonexp.org/. Visit the Web site; if it is no 
longer operative, try an Internet search on “Zimbardo” and “prison” to find in- 
formation about this study. 


a. Briefly describe the study and its findings. 

b. The Web site has a number of discussion questions related to the study. Here 
is one of them: “Was it ethical to do this study? Was it right to trade the suf- 
fering experienced by participants for the knowledge gained by the re- 
search?” Discuss these questions. 


*4, 


5. 


6. 


a 
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c. Another question asked at the Web site is, “How do the ethical dilemmas in 
this research compare with the ethical issues raised by Stanley Milgram’s 
obedience experiments? Would it be better if these studies had never been 
done?” Discuss these questions. 


In the report Principles and Practices for a Federal Statistical Agency (Martin, 
Straf, and Citro, 2001), one of the recommended practices is “a strong position 
of independence.” Each of the following parts gives one of the characteristics 
that is recommended to help accomplish this. In each case, explain how the rec- 
ommendation would help ensure a position of independence for the agency. 


a. “Authority for selection and promotion of professional, technical, and oper- 
ational staff” (p. 6). 


b. “Authority for statistical agency heads and qualified staff to speak about the 
agency’s statistics before Congress, with congressional staff, and before pub- 
lic bodies” (p. 6). 


Refer to Example 2 about Janet’s dissertation research. Explain whether each of 
the following changes would have been a good idea or a bad idea: 


*a, Have someone who is blind to conditions and has not heard the volunteers’ 
voices before listen to the phone calls to assess whether a firm “no” was 
given. 

b. Have Janet flip a coin each time a volunteer comes to her office to decide if 
they should go into the role-playing group or the control group. 


c. Use a variety of different solicitors rather than Brad alone. 


Refer to Example 3 about Jake’s memory experiment. What do you think Jake 
should have done regarding the analysis and reporting of it? 


Find an example of a statistical study reported in the news. Explain whether you 
think multiple tests were done, and, if so, if they were reported. 


Marilyn is a statistician who works for a company that manufactures compo- 
nents for sound systems. Two development teams have each come up with a new 
method for producing one of the components, and management is to decide 
which one to adopt based on which produces components that last longer. Mar- 
ilyn is given the data for both and is asked to make a recommendation about 
which one should be used. She computes 95% confidence intervals for the mean 
lifetime using each method and finds an interval from 96 hours to 104 hours, 
centered on 100 hours for one of them, and an interval from 92 hours to 112 
hours, centered on 102 for the second one. What should she do? Should she 
make a clear recommendation? Explain. 


Use the following scenario for Exercises 8 to 12: Based on the information in Case 
Study 26.1, describe the extent to which each of the following conditions for reduc- 
ing the experimenter effect (listed in Section 26.2) was met. If you cannot tell from 
the description of the experiment, then explain what additional information you 
would need. 


8. Randomization done by a third party with no vested interest in the experiment, 


or at least done by a well-tested computer randomization device. 
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9. Automated data recording without intervention from the researcher. 


*10. Double-blind procedures to ensure that no one who has contact with the partic- 
ipants knows which treatment or condition they are receiving. 


11. An honest evaluation that what is being measured is appropriate and unbiased 
for the research question of interest. 


12. A standard protocol for the treatment of all participants that must be strictly 
followed. 


Exercises 13 to 22: Explain the main ethical issue of concern in each of the follow- 
ing. Discuss what, if anything, should have been done differently to address that 
concern. 


13. Example 1, Stanley Milgram’s experiment. 


14. In Example 2, Janet’s decision to make Group 2 the control group after the first 
volunteer came to her office. 


15. In Example 2, Janet’s handling of the fact that her data showed 26 volunteers in 
the control group when there should have been only 25. 


16. In Example 3, Jake’s decision to replace the outlier with the median of 70. 


*17. In Example 3 in Chapter 20, the Environmental Protection Agency’s decision to 
report a 90% confidence interval. 


18. In Example 4, the fact that many media stories mentioned increased risk of 
breast cancer and coronary heart disease but not any of the other results. 


*19, In Case Study 26.1, the concerns raised in the letter from Freinkel. 
20. In Case Study 26.1, the concerns raised in the letter from Carpenter et al. 
21. In Case Study 26.1, the concerns raised in the letter from Schmidt. 


22. In Case Study 26.1, the concerns raised about the experimenter effect. 
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Putting What You Have 
Learned to the Test 


This chapter consists solely of case studies. Each case study begins with a full or par- 
tial article from a newspaper or journal. Read each article and think about what 
might be misleading or missing, or simply misinformation. A discussion following 
each article summarizes some of the points I thought should be raised. You may be 
able to think of others. I hope that as you read these case studies you will realize that 
you have indeed become an educated consumer of statistical information. 


CASE STUDY 27.1 
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Cranberry Juice and Bladder Infections 


Source: “Juice does prevent infection” (9 March 1994), Davis (CA) Enterprise, p. A9. 
Reprinted with permission. 


Cuicaco (AP)—A scientific study has proven what many women have long sus- 
pected: Cranberry juice helps protect against bladder infections. 

Researchers found that elderly women who drank 10 ounces of a juice drink con- 
taining cranberry juice each day had less than half as many urinary tract infections 
as those who consumed a look-alike drink without cranberry juice. 

The study, which appeared today in the Journal of the American Medical Asso- 
ciation, was funded by Ocean Spray Cranberries, Inc., but the company had no role 
in the study’s design, analysis or interpretation, JAMA said. 

“This is the first demonstration that cranberry juice can reduce the presence of 
bacteria in the urine in humans,” said lead researcher Dr. Jerry Avorn, a specialist in 
medication for the elderly at Harvard Medical School. 


Discussion 


This study was well conducted, and the newspaper report is very good, including in- 
formation such as the source of the funding and the fact that there was a placebo (“a 
look-alike drink without cranberry juice”). A few details are missing from the news 
account, however, that you may consider to be important. They are contained in the 
original report (Avorn et al., 1994): 


1. There were 153 subjects. 
2. The participants were randomly assigned to the cranberry or placebo group. 


3. The placebo was fully described as “a specially prepared synthetic placebo drink 
that was indistinguishable in taste, appearance, and vitamin C content, but lacked 
cranberry content.” 


4. The study was conducted over a 6-month period, with urine samples taken 
monthly. 


5. The measurements taken were actually bacteria counts from urine, rather than a 
more subjective assessment of whether an infection was present. The news story 
claims “less than half as many urinary tract infections,” but the original report 
gave the odds of bacteria levels exceeding a certain threshold. The odds in the 
cranberry juice group were only 42% of what they were in the control group. Un- 
reported in the news story is that the odds of remaining over the threshold from 
one month to the next for the cranberry juice group were only 27% of what they 
were for the control group. 

6. The participants were elderly women volunteers with a mean age of 78.5 years 
and high levels of bacteria in their urine at the start of the study. 

7. The original theory was that if cranberry juice was effective, it would work by in- 
creasing urine acidity. That was not the case, however. The juice inhibited bacte- 
rial growth in some other way. 
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CASE STUDY 27.2 


Children on the Go 


Source: Coleman, Brenda C. (15 September 1993), Children who move often are problem 
prone, West Hawaii Today. Reprinted with permission. 


Cuicaco—Children who move often are 35 percent more likely to fail a grade and 
77 percent more likely to have behavioral problems than children whose families 
move rarely, researchers say. 

A nationwide study of 9915 youngsters ages 6 to 17 measured the harmful ef- 
fects of moving. The findings were published in today’s issue of the Journal of the 
American Medical Association. 

About 19 percent of Americans move every year, said the authors, led by Dr. 
David Wood of Cedars-Sinai Medical Center in Los Angeles. The authors cited a 
1986-87 Census Bureau study. 

The authors said that our culture glorifies the idea of moving through maxims 
such as “Go West, young man.” 

Yet moving has its “shadow side” in the United States, where poor and minority 
families have been driven from place to place by economic deprivation, eviction and 
racism, the researchers wrote. 

Poor families move 50 percent to 100 percent more often than wealthier fami- 
lies, they said, citing the Census Bureau data. 

The authors used the 1988 National Health Interview Survey and found that 
about one-quarter of children had never moved, about half had moved fewer than 
three times and about three-quarters fewer than four times. 

Ten percent had moved at least six times, and the researchers designated them 
“high movers.” 

Compared with the others, the high movers were 1.35 times more likely to have 
failed a grade and 1.77 times more likely to have developed at least four frequent be- 
havioral problems, the researchers said. Behavioral problems ranged from depres- 
sion to impulsiveness to destructiveness. 

Frequent moving had no apparent effect on development and didn’t appear to 
cause learning disabilities, they found. 

The researchers said they believe their study is the first to measure the effects of 
frequent relocation on children independent of other factors that can affect school 
failure and behavioral problems. 

Those factors include poverty, single parenting, belonging to a racial minority 
and having parents with less than a high school education. 

Children in families with some or all of those traits who moved often were much 
more likely to have failed a grade—1.8 to 6 times more likely—than children of 
families with none of those traits who seldom or never moved. 

The frequently relocated children in the rougher family situations also were 1.8 
to 3.6 times more likely to have behavioral problems than youngsters who stayed put 
and lived in more favorable family situations. 

“A family move disrupts the routines, relationships and attachments that define 
the child’s world,” researchers said. “Almost everything outside the family that is fa- 
miliar is lost and changes.” 
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Dr. Michael Jellinek, chief of child psychiatry at Massachusetts General Hospi- 
tal, said he couldn’t evaluate whether the study accurately singled out the effect of 
moving. 


Discussion 


There are some problems with this study and with the reporting of it in the newspa- 
per. First, it is not until well into the news article that we finally learn that the re- 
ported figures have already been adjusted for confounding factors, and even then the 
news article is not clear about this. In fact, the 1.35 relative risk of failing a grade 
and the 1.77 relative risk of at least four frequent behavioral problems apply after 
adjusting for poverty, single parenting, belonging to a racial minority, and having 
parents with less than a high school education. 

The news report is also missing information about baseline rates. From the orig- 
inal report (Wood et al., 1993), we learn that 23% of children who move frequently 
have repeated a grade, whereas only 12% of those who never or infrequently move 
have repeated a grade. 

The news report also fails to mention that the data were based on surveys with 
parents. The results could therefore be biased by the fact that some parents may not 
have been willing to admit that their children were having problems. 

Although the news report implies that moving is the cause of the increase in 
problems, this is an observational study and causation cannot be established. Nu- 
merous other confounding factors that were not controlled for could account for the 
results. Examples are number of days missed at school, age of the parents, and qual- 
ity of the schools attended. 
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CASE STUDY 27.4 


CASE STUDY 27.5 
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You Can Work and Get Your Exercise at the Same Time 
Source: White collar commute (11 March 1994), Davis (CA) Enterprise, p. CS. 


One in five clerical workers walks about a quarter mile a day just to complete rou- 
tine functions like faxing, copying and filing, a national survey on office efficiency 
reports. The survey also shows that the average office worker spends close to 15 
percent of the day just walking around the office. Not surprisingly, the survey was 
commissioned by Canon U.S.A., maker of—you guessed it—office copiers and 
printers that claim to cut the time and money “spent running from one machine to 
the next.” 


Discussion 


We are not given any information that would allow us to evaluate the results of this 
survey. Further, what does it mean to say that one in five workers walk that far? Do 
the others walk more or less? Is the figure based on an average of the top 20% (one 
in five) of a set of numbers, in which case outliers for office delivery personnel and 
others are likely to distort the mean? 


Sex, Alcohol, and the First Date 


Source: “Teen alcohol-sex survey finds the unexpected” (11 May 1994), Sacramento Bee, 
p. A11. Reprinted with permission. 


WASHINGTON— Young couples are much more likely to have sex on their first date if 
the male partner drinks alcohol and the woman doesn’t, new research shows. 

The research, to be presented today at an American Psychological Association 
conference on women’s health, contradicts the popular male notion that plying a 
woman with alcohol is the quickest path to sexual intercourse. 

In fact, interviews with 2052 teenagers found that they reported having sex on a 
first date only 6 percent of the time if the female drank alcohol while the male did 
not. That was lower than the 8 percent who reported having sex when neither part- 
ner drank. 

Nineteen percent of the teens reported having sex when both partners drank, but 
the highest frequency of sex on the first date—24 percent—was reported when only 
the male drank. 
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The lead researcher in the study, Dr. M. Lynne Cooper of the State University of 
New York at Buffalo, said that drinking may increase a man’s willingness “to self- 
disclose things about himself, be more likely to communicate feelings, be more ro- 
mantic—and the female responds to that.” 

Rather than impairing a woman’s judgment, alcohol apparently makes many 
women more cautious, Cooper said. “Women may sort of smell danger in alcohol, 
and it may trigger some warning signs,” she said. “It makes a lot of women more 
anxious.” 


Discussion 


This is an excellent example of a misinterpreted observational study. The authors of 
both the original study and the news report are making the assumption that drinking 
behavior influences sexual behavior. Because the drinking behavior was clearly not 
randomly assigned, there is simply no justification for such an assumption. 

Perhaps the causal connection is in the reverse direction. The drinking scenario 
that was most frequently associated with sex on the first date was when the male 
drank but the female did not. If a couple suspected that the date would lead to sex, 
perhaps they would plan an activity in which they both had access to alcohol, but the 
female decided to keep her wits. 

The simplest explanation is that the teenagers did not tell the truth about their 
own behavior. It would be less socially acceptable for a female to admit that she 
drank alcohol and then had sex on a first date than it would be for a male. There- 
fore, it could be that males and females both exaggerated their behavior, but in dif- 
ferent directions. 
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Nursing Moms Can Exercise, Too 


Source: “Aerobics OK for breast-feeding moms” (17 February 1994), Davis (CA) Enterprise, 
p. A7. Reprinted with permission. 


Moderate aerobic exercise has no adverse effects on the quantity or quality of breast 
milk produced by nursing mothers, and can significantly improve the mothers’ car- 
diovascular fitness, according to UC Davis researchers. 

For 12 weeks the study monitored 33 women, beginning six to eight weeks after 
the births of their children. All were exclusively breast-feeding their infants, with no 
formula supplements, and had not previously been exercising. Eighteen women were 
randomly assigned to an exercise group and 15 to a non-exercising group. The exer- 
cise group participated in individual exercise programs, including rapid walking, 
jogging or bicycling, for 45 minutes each day, five days per week. . . . At the end of 
the 12-week study, [the researchers] found: 

Women in both groups experienced weight loss. The rate of weight loss and the 
decline in the percentage of body fat after childbirth did not differ between the ex- 
ercise and control groups, because women in the exercise group compensated for 
their increased energy expenditure by eating more. 

There was an important improvement in the aerobic fitness of the exercising 
women, as measured by the maximal oxygen consumption. 

There was no significant difference between the two groups in terms of infant 
breast-milk intake, energy output in the milk or infant weight gain. 

Prolactin levels in the breast milk did not differ between the two groups, sug- 
gesting that previously observed short-term increases in the level of that hormone 
among non-lactating women following exercise do not influence the basal level of 
prolactin. 


Discussion 


This is an excellent study and excellent reporting. The study was a randomized ex- 
periment and not an observational study, so we can rule out confounding factors re- 
sulting from the fact that mothers who choose to exercise differ from those who do 
not. The mothers were randomly assigned to either the exercise group or a nonexer- 
cising control group. To increase ecological validity and generalizability, they were al- 
lowed to choose their own forms of exercise. All mothers were exclusively breast 
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feeding, ruling out possible interactions with other food intake on the part of the in- 
fants. The study could obviously not be performed blind because the women knew 
whether they were exercising. It presumably could have been single blind, but was not. 

The article reports that there was an important improvement in fitness; it does not 
give the actual magnitude. The reporter has done the work of determining that the 
improvement is not only statistically significant but also of practical importance. 
From the original research report, we learn that “maximal oxygen uptake increased 
by 25 percent in the exercising women but by only 5 percent in the control women 
(P < .001)” (Dewey et al., 1994, p. 449). 

What about the differences that were reported as nonsignificant or nonexistent, 
such as infant weight gain? There were no obvious cases of misreporting important 
but not significant differences. Most of the variables for which no significant differ- 
ences were found were very close for the two groups. In the original report, confi- 
dence intervals are given for each group, along with a p-value for testing whether 
there is a statistically significant difference. For instance, 95% confidence intervals 
for infant weight gain are 1871 to 2279 grams for the exercise group and 1733 to 
2355 grams for the control group (p = 0.86). The p-value indicates that if aerobic 
exercise has no impact on infant weight gain in the population, we would expect to 
see sample results differ this much or more very often. Therefore, we cannot rule out 
chance as an explanation for the very small differences observed. The same is true 
of the other variables for which no differences were reportedly found. 
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I. News Stories and Sources in the 
Appendix and on the CD 


News Stories: All of the news stories are in the Appendix, 
and all but numbers 7 and 10 are on the CD as well. 


Additional News Stories: An additional news story ac- 
companies Studies 3, 10, 15, and 19. These are on the CD 
but not in the Appendix. 


Original Sources: Most of the journal articles and reports 
listed as Original Sources are on the CD. Exceptions are 
numbers 14, 18, and 19, for which the original publishers 
did not grant reasonable permission, and numbers 15 and 
16 for which only the abstracts are on the CD. (The abstract 
for a journal article is a short summary prepared by the au- 
thors, which usually appears at the beginning of the article.) 


News Story 1 


“Mending Mindfully” 
Heidi Kotansky, Yoga Journal, November 2003, p. 35. 


Original Source 1: Davidson, Richard J., PhD; Kabat- 
Zinn, Jon, PhD; Schumacher, Jessica, MS; Rosenkranz, 
Melissa, BA; Muller, Daniel, MD, PhD; Santorelli, Saki F., 
EdD; Urbanowski, Ferris, MA; Harrington, Anne, PhD; 
Bonus, Katherine, MA; and Sheridan, John F., PhD. 
(2003). “Alterations in Brain and Immune Function Pro- 
duced by Mindfulness Meditation,” Psychosomatic 
Medicine 65, pp. 564-570. 


News Story 2 


“Research Shows Women Harder Hit by Hangovers” 
Lee Bowman, Sacramento Bee (from Scripps Howard 
News Service) 15 September 2003, p. A7. 


and Accompanying CD 


Original Source 2: Slutske, Wendy S.; Piasecki, 
Thomas M.; and Hunt-Carter, Erin E. (2003). “Develop- 
ment and Initial Validation of the Hangover Symptoms 
Scale: Prevalence and Correlates of Hangover Symptoms 
in College Students.” Alcoholism: Clinical and Experi- 
mental Research 27, pp. 1442-1450. 


News Story 3 


“Rigorous Veggie Diet Found to Slash Cholesterol” 
Sacramento Bee (original by Daniel Q. Haney, Associated 
Press) 7 March 2003, p. A9. 


Additional News Story 3 
http://www.newsandevents.utoronto.ca/bin5/030722a.asp 


University of Toronto news release, July 22, 2003, “Diet 
as good as drug for lowering cholesterol, study says” by 
Lanna Crucefix. 


Original Source 3: Jenkins, David J. A.; Kendall, Cyril 
W. C.; Marchie, Augustine; Faulkner, Dorothea A.; Wong, 
Julia M. W.; de Souza, Russell; Emam, Azadeh; Parker, 
Tina L.; Vidgen, Edward; Lapsley, Karen G.; Trautwein, 
Elke A.; Josse, Robert G.; Leiter, Lawrence A.; and Con- 
nelly, Philip W. (23 July 2003). “Effects of a Dietary Port- 
folio of Cholesterol-Lowering Foods vs Lovastatin on 
Serum Lipids and C-Reactive Protein.” Journal of the 
American Medical Association 290, no. 4, pp. 502-510. 


News Story 4 


“Happy People Can Actually Live Longer” 
Henry Reed, Venture Inward Magazine, October/ 
November 2003, p. 9. 
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Original Source 4: 
http://dukemednews.duke.edu/news/article.php?id=65 15 


“Duke Health Briefs: Positive Outlook Linked to Longer 
Life in Heart Patients.” 21 April 2003. 


News Story 5 


“Driving While Distracted Is Common, Researchers 
Say; Reaching for Something in the Car and Fiddling 
with the Stereo Far Outweigh Inattentiveness 
Caused by Cellphone Use, Traffic Study Finds” 
Susannah Rosenblatt, Los Angeles Times (Los Angeles, 
Calif.) 7 August 2003, p. A.20. 


Original Source 5: Stutts et al., “Distractions in Everyday 
Driving.” Technical report prepared for the AAA Founda- 
tion for Traffic Safety, June 2003 (129 pages). Available 
online at: 


http://www.aaafoundation.org/pdf/ 
DistractionsInEverydayDriving.pdf 


News Story 6 


“Music as Brain Builder” 
Constance Holden (26 March 1999), Science 283, p. 2007. 


Original Source 6: Graziano, Amy B.; Peterson, 
Matthew; and Shaw, Gordon L. (March 1999). “Enhanced 
Learning of Proportional Math through Music Training 
and Spatial-temporal Training.” Neurological Research 21, 
pp. 139-152. 


News Story 7 (Not on CD) 


“State Reports Find Fraud Rate of 42% 

in Auto Body Repairs” 

Edgar Sanchez, Sacramento Bee, 16 September 2003, 
p. B2. 


Original Source 7: California Department of Consumer 
Repairs. (September 2003). “Auto Body Repair 
Inspection Pilot Program: Report to the Legislature.” 
Department of Consumer Repairs, technical report, 


http://www.dca.ca.gov/r_r/autobody_report_090903.pdf 


News Story 8 


“Education, Kids Strengthen Marriage” 
Marilyn Elias, USA Today, 7 August 2003, p. 8D. 


Original Source 8: None—story based on poster presen- 
tation at the American Psychological Association’s annual 
meeting, Toronto, Canada, August 2003. 


News Story 9 


“Against Depression, a Sugar Pill Is Hard to Beat” 
Shankar Vedantam, Washingtonpost.com, 7 May 2002, 
p. AOL. 


Original Source 9: Khan, Arif, MD; Khan, Shirin; Kolts, 
Russell, PhD; and Brown, Walter A., MD. (April 2003). 
“Suicide Rates in Clinical Trials of SSRIs, Other Antide- 
pressants, and Placebo: Analysis of FDA Reports.” Ameri- 
can Journal of Psychiatry 160, pp. 790-792. 


News Story 10 (Not on CD) 
“Churchgoers Live Longer, Study Finds” 


Nancy Weaver Teichert, Sacramento Bee, 24 December 
2001, p. A3. 


Additional News Story 10 
“Keeping the Faith: UC Berkeley Researcher Links 
Weekly Church Attendance to Longer, Healthier Life” 


http://www. berkeley.edu/news/media/releases/ 
2002/03/26_faith.html 


UC Berkeley press release. 


Original Source 10: Oman, Doug; Kurata, John H.; 
Strawbridge, William J.; and Cohen, Richard D. (2002). 
“Religious attendance and cause of death over 31 years.” 
International Journal of Psychiatry in Medicine 32, 

pp. 69-89. 


News Story 11 
“Double Trouble Behind the Wheel” 


John O’Neil, Sacramento Bee (from New York Times) 
14 September 2003, p. L8. 


Original Source 11: Horne, J. A.; Reyner, L. A.; and Bar- 
rett, P. R. (2003). “Driving impairment due to sleepiness is 
exacerbated by low alcohol intake.” Journal of Occupa- 
tional and Environmental Medicine 60, pp. 689-692. 


News Story 12 


“Working Nights May Increase Breast Cancer Risk” 
Sacramento Bee (Associated Press) 17 October 2001, 
p. A7. 


Original Source 12a: Hansen, Johnni. (17 October 2001). 
“Light at Night, Shiftwork, and Breast Cancer Risk.” 
Journal of the National Cancer Institute 93, no. 20, pp. 
1513-1515. 


Original Source 12b: Davis, Scott; Mirick, Dana K.; and 
Stevens, Richard G. (17 October 2001). “Night Shift 
Work, Light at Night, and Risk of Breast Cancer.” Journal 
of the National Cancer Institute 93, no. 20, pp. 
1557-1562. 


Original Source 12c: Schernhammer, Eva S.; Laden, 
Francine; Speizer, Frank E.; Willett, Walter C.; Hunter, 
David J.; and Kawachi, Ichiro. (17 October 2001). “Rotat- 
ing Night Shifts and Risk of Breast Cancer in Women Par- 
ticipating in the Nurses’ Health Study.” Journal of the 
National Cancer Institute 93, no. 20, pp. 1563-1568. 


News Story 13 
“3 Factors Key for Drug Use in Kids” 


Sacramento Bee (Jennifer C. Kerr, Associated Press) 20 
August 2003, p. A7. 


Original Source 13: “2003 CASA National Survey 
of American Attitudes on Substance Abuse VIII: Teens 
and Parents.” A report from the National Center 

on Addiction and Substance Abuse at Columbia 
University, August 2003. Available at: 


www.casacolumbia.org/usr_doc/2003_Teen_Survey.pdf 


News Story 14 


“Study: Emotion Hones Women's Memory” 
Sacramento Bee (from Associated Press by Paul Recer) 
23 July 2002, p. Al, A11. 


Original Source 14 (Not on CD): Canli, Turhan; 
Desmond, John E.; Zhao, Zuo; and Gabrieli, John D. E. 
(2002). “Sex differences in the neural basis of emotional 
memories.” Proceedings of the National Academy of Sci- 
ences 99, pp. 10789-10794. 


News Story 15 
“Kids' Stress, Snacking Linked” 


Jane E. Allen, Sacramento Bee (orig. Los Angeles Times) 
7 September 2003, p. L8. 


Additional News Story 15 


“Stress Linked to Obesity in School-age Children” 
André Picard, The Globe and Mail [UK] 2 August 2003 
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http://www.globeandmail.com/servlet/story/ 
RTGAM.20030801.ufatt0802/BNStory/National/ 


Original Source 15: (Abstract only on the CD) 
Cartwright, Martin; Wardle, Jane; Steggles, Naomi; Si- 
mon, Alice E.; Croker, Helen; and Jarvis, Martin J. (Au- 
gust 2003). “Stress and Dietary Practices in Adolescents.” 
Health Psychology 22, no. 4, pp. 362-369. 


News Story 16 


“More on TV Violence” 
Constance Holden (21 March 2003), Science 299, 
p. 1839. 


Original Source 16: (Abstract only on the CD) 
Huesmann, L. Rowell; Moise-Titus, Jessica; Podolski, 
Cheryl-Lynn P.; and Eron, Leonard D. (2003). “Longitudi- 
nal Relations Between Children’s Exposure to TV Vio- 
lence and Their Aggressive and Violent Behavior in Young 
Adulthood: 1977-1992.” Developmental Psychology 39, 
no. 2, pp. 201-221. 


News Story 17 


“Even when Monkeying Around, Life Isn't Fair” 
Jamie Talan, San Antonio Express-News (original from 
Newsday.com, 18 September 2003) 21 September 2003, 
p. 18A. 


Original Source 17: Brosnan, Sarah F.; and de Waal, 
Frans B. M. (18 September 2003). “Monkeys reject un- 
equal pay.” Nature 425, pp. 297-299. 


News Story 18 


“Heavier Babies Become Smarter 

Adults, Study Shows” 

Sacramento Bee (by Emma Ross, Associated Press) 
26 January 2001, p. A11. 


Original Source 18 (Not on CD): Richards, Marcus; 
Hardy, Rebecca; Kuh, Diana; and Wadsworth, Michael 

E. J. January 2001). “Birth weight and cognitive function 
in the British 1946 birth cohort: longitudinal population 
based study.” British Medical Journal 322, pp. 199-203. 


News Story 19 


“Young Romance May Lead 

to Depression, Study Says” 

Sacramento Bee (by Malcolm Ritter, Associated Press) 
14 February 2001, p. A6. 
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Additional News Story 19 


“Puppy love’s dark side: First study of love-sick teens 
reveals higher risk of depression, alcohol use and 
delinquency” 


http://www.news.cornell.edu/releases/May01/teenlove.ssl.html 
Cornell University press release. 


Original Source 19 (Not on CD): Joyner, Kara; and Udry, 
J. Richard. (2000). “You Don’t Bring Me Anything But 
Down: Adolescent Romance and Depression.” Journal of 
Health and Social Behavior 41, no. 4, pp. 369-391. 


News Story 20 


“Eating Organic Foods Reduces Pesticide 
Concentrations in Children” 


http://www.newfarm.org/news/030503/pesticide_kids.shtm]l 


The New Farm, News and Research, online; Copyright 
The Rodale Institute. 


Original Source 20: Curl Cynthia L.; Fenske, Richard A.; 
and Elgethun, Kai. (March 2003). “Organophosphorus 
Pesticide Exposure of Urban and Suburban Preschool 
Children with Organic and Conventional Diets.” Environ- 
mental Health Perspectives 111, no. 3, pp. 377-382. 


ll. Applets 


1. Sampling Applet: This applet helps you to explore 
simple random sampling, described in Section 4.4. It 
shows a simple random sample of size 10 being selected 
from a population of 100 individuals. You can see how the 
proportion of females and the mean height change with 
each sample you take. 


2. Empirical Rule Applet: This applet allows you to ex- 
plore how well the Empirical Rule from Chapter 8 works 
on data with various shapes. The data were collected from 


students at the University of California at Davis and Penn 
State University. 


3. Correlation Applet: This applet helps you visualize 
correlation, described in Chapter 10, and the way it can be 
influenced by a few points, as described in Chapter 11. It 
presents you with the challenge of placing points on a 
scatterplot to try to obtain specified correlation values. 
You can observe how the correlation changes as a result of 
outliers and other interesting points. 


4. Sample Means Applet: This applet helps you under- 
stand the Rule for Sample Means from Section 19.3. It 
generates random samples from a population in which the 
mean is 8 and the standard deviation is 5, like the weight 
loss example in Section 19.3. A histogram of the means 
from the samples is constructed, and you can see how its 
shape approaches a bell shape when hundreds of sample 
means are included. 


5. TV Means Applet: Like the Sample Means Applet, 
this applet helps you understand the Rule for Sample 
Means from Section 19.3. For this applet, unlike the pre- 
vious one, the shape of the population is highly skewed; 
the data represent hours of TV college students reported 
watching in a typical week. The histogram of sample 
means still becomes bell-shaped when enough (large) 
samples are chosen, but it takes longer than it did in the 
Sample Means applet because of the initial skewed distri- 
bution. 


6. Confidence Level Applet: This applet illustrates the 
idea of a confidence interval from Chapter 20 and confi- 
dence level from Chapter 21. You choose a confidence 
level. The applet will generate repeated samples from a 
population representing weights of college-age men and 
calculate a confidence interval for the mean based on each 
sample, using the confidence level you specified. The goal 
is to see if the proportion of intervals that capture the pop- 
ulation mean is close to the specified confidence level. 
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News Story 2 
Research shows women harder hit by hangovers 
By Lee Bowman, Scripps Howard News Service 


In the life-is-not-fair category, new research finds that 
women not only get drunk on fewer drinks than men but 
women also suffer from worse hangovers. 

A team at the University of Missouri-Columbia devel- 
oped a new scientific scale for measuring hangover symp- 
toms and severity. Even accounting for differences in the 
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amount of alcohol consumed by men and women, hang- 
overs hit women harder. “This finding makes biological 
sense, because women tend to weigh less and have lower 
percentages of total body water than men do, so they should 
achieve higher degrees of intoxication and, presumably, 
more hangover per unit of alcohol,” said Wendy Slutske, an 
associate professor of psychology who led the team. 

The study, supported by the National Institutes of 
Health, is being published Monday in the journal Alco- 
holism: Clinical and Experimental Research. 

The researchers asked 1,230 drinking college students, 
only 5 percent of them of legal drinking age, to describe 
how often they experienced any of 13 symptoms after drink- 
ing. The symptoms in the study ranged from headaches and 
vomiting to feeling weak and unable to concentrate. 

Besides women, the study found that the symptoms 
were more common in students who reported having 
alcohol-related problems or who had one or both biological 
parents with a history of alcohol-related problems. 

“We were surprised to discover how little research had 
been conducted on hangover, because the research that does 
exist suggests that hangover could be an important factor in 
problem drinking,” said Thomas Piasecki, an assistant pro- 
fessor of clinical psychology who took part in the study. 

Other research has pinpointed hangover impairment as 
an important factor in drinkers suffering injury or death, 
and in economic losses arising from people taking time off 
to recover from a drinking bout. 

The most common symptom reported was dehydration or 
feeling thirsty. The least common symptom was trembling or 
shaking. Based on having at least one of the symptoms, most 
students had been hung over between three and 11 times in 
the previous year. On average, students had experienced five 
of the 13 symptoms at least once during that period. 

“While hangover is a serious phenomenon among col- 
lege drinkers, for most of them it occurs rarely enough that 
it is unlikely to have a major deleterious impact on aca- 
demic performance,” Slutske said. 

However, 26 percent of the students reported having ex- 
perienced hangovers at least once a month in the past year, 
and the researchers speculate they could be at a higher risk 
of failure, and wondered if they might represent some iden- 
tifiable segment of the student population, such as frater- 
nity or sorority members. 


Source: From Bowman, Lee, “Research Shows Women Harder 
Hit by Hangovers,” September 15, 2003. Copyright © 2003 
Scripps Howard News Service. Reprinted with permission. 
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News Story 5 


Driving while distracted is common, researchers 
say; reaching for something in the car and fiddling 
with the stereo far outweigh inattentiveness caused 
by cellphone use, traffic study finds 


By Susannah Rosenblatt 


Nearly all drivers become distracted when behind the 
wheel, and cellphones aren’t the main culprit, a study re- 
leased Wednesday found. 

Reaching for something inside the car and fiddling with 
the audio system are the primary causes of driver inatten- 
tion, according to a report by the American Automobile 
Assn.’s Foundation for Traffic Safety, a nonprofit driver ed- 
ucation group. 

“What we want to do is make people more aware of the 
fact that there are distractions beyond cellphones,” said 
Stephanie Faul, the group’s communications director. 
“When you’re fumbling around in the foot well because 
you’ve dropped part of your sandwich, that’s as much a dis- 
traction as anything else.” 

The study was conducted from 2001 to 2002 by re- 
searchers at the University of North Carolina’s Highway 
Safety Research Center in Chapel Hill and by Philadelphia 
traffic research consultants TransAnalytics. Researchers 
used small windshield-mounted cameras to videotape 70 
volunteers from Chapel Hill and Philadelphia driving their 
own cars. 

The project then analyzed three random hours of tape 
from the average of seven to eight hours collected for each 
subject and tabulated their concentration when behind the 
wheel. 

More than 97% of the drivers leaned or reached for 
something inside the car; more than 91% adjusted the vehi- 
cle’s audio controls. More than three-quarters carried on 
conversations and more than 71% ate or drank behind the 
wheel, the report found. Just 30% of the 28 drivers with 
cellphones used them in the car, according to the report. 

“Cellphones have been getting a disproportionate 
amount of negative attention relative to the type of distrac- 
tion they cause,” said Barbara Harsha, executive director of 
the Washington-based Governors Highway Safety Assn., 
which seeks to educate states about traffic safety. “People 
just love to hate cellphones.” 
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Excluding talking to passengers, drivers engaged in a 
potentially distracting activity 16.1% of the time that their 
cars were moving. The scope of the problem is consider- 
able: Inattentive drivers account for up to 30% of police- 
reported traffic accidents, about 1.2 million a year, accord- 
ing to the National Highway Traffic Safety Administration. 

Only 17 states, including California, collect information 
on driver cellphone usage and inattention that contribute to 
traffic accidents, Harsha said. 

The Governors Highway Safety Assn. has collaborated 
with federal transportation organizations to draft guidelines 
for states on how to gather such data. The California High- 
way Patrol found that a little more than 1% of the 491,083 
drivers involved in traffic collisions in the state from Janu- 
ary to June 2002 were inattentive in their driving. Of those 
5,700 drivers, 11% were talking on cellphones and 9% 
were tuning the radio. Six of the 611 cellphone-related 
crashes were fatal. The majority —67%—of the accidents 
were caused by other distractions, from daydreaming to 
reading street signs, the study found. 

Inattention is “not a huge problem in California,” said 
Anne DaVigo, a spokeswoman for the California Highway 
Patrol. But cellphone use is especially noticeable on con- 
gested, slow-moving California freeways, DaVigo said. 
“While the numbers of persons killed or injured [due to 
cellphone use] aren’t high, obviously we’re concerned 
when people are dying.” 

The traffic foundation study was the second phase of a 
project that began in 1999 with the analysis of NHTSA 
crash data to determine the extent that driver distraction 
caused accidents. The recent phase of the project was de- 
signed to examine “what people actually do in their cars,” 
said Jane Stutts, the report’s lead researcher. 

Stutts’ team had to halve the sample size from 144 peo- 
ple to 70 when it ran into budget and time constraints while 
cataloging hundreds of hours of videotape. The reduced 
sample size did not compromise the findings, Stutts said, 
although it did make analyzing population subsets difficult. 
“We’re very comfortable with what we have,” she said, 
adding that the NHTSA has several larger-scale driver dis- 
traction studies underway. 

The foundation hopes to help combat the problem of 
driver inattention with educational warning sections in state 
driver’s license manuals. Only six states now offer infor- 
mation to new drivers on the dangers of distracted driving. 
The organization also has unveiled a public service an- 
nouncement campaign to inform drivers of the risks of 
inattention. 

“Tf the government, industry and the traffic safety com- 
munity can join together to educate the public, we can 


make significant progress fighting the very real problem of 
distracted driving,’ Robert L. Darbelnet, the foundation’s 
president and chief executive, said in a statement. 


Infobox: 


Driven to distraction 

Among the many things that weaken driver attention, 
the study found, cellphones are low on the list. With the 
exception of manipulating air conditioning and other 
controls, which all subjects were observed doing, here is 
the percentage of drivers who engaged in the most com- 
mon distracting activities: 


Reaching, leaning, etc.—97.1% 
Changing audio controls—91.4% 
External distraction—85.7% 
Conversing—77.1% 
Eating/drinking/spilling—7 1.4% 
Misc. internal distractions—67.1% 
Preparing to eat/drink—58.6% 
Grooming—45.7% 

Reading or writing—40.0% 
Talking on cellphone—30.0% 
Dialing cellphone—27.1% 

Adult distraction—22.9% 
Answering cellphone—15.7% 
Child distraction—12.9% 

Baby distraction—8.6% 
Smoking—7.1% 


Source: AAA Foundation for Traffic Safety 


Source: From “Driving While Distracted Is Common, Re- 
searchers Say,” by Susannah Rosenblatt. Los Angeles Times, Los 
Angeles, Calif.: Aug 7, 2003. p. A.20. With permission of Tri- 
bune Media Services. 


News Story 6 
Music as brain builder 
By Constance Holden 


Many parents and day-care facilities nowadays expose tots 
to classical music in hopes of triggering the so-called 
“Mozart effect”—the sharpening of the brain that some 
classical music is said to bring about. Now the researchers 
who started the trend, Gordon Shaw and colleagues at the 
University of California, Irvine, have come up with evi- 
dence that piano lessons hike up children’s performance on 
a test of proportional math. 

Six years ago, Shaw’s group found that listening to a 
Mozart two-piano sonata briefly raised college students’ 


spatial skills. They subsequently reported that in preschool- 
ers, piano lessons gave a sustained boost to spatial skills. 

In the latest study, Shaw compared three groups of sec- 
ond-graders: 26 got piano instruction plus a math video 
game that trains players to mentally rotate shapes and to use 
them to learn ratios and fractions. Another 29 got com- 
puter-based English training plus the video game. A control 
group of 28 got no special training. After 4 months, the re- 
sults were “dramatic,” the authors report in the current is- 
sue of Neurological Research. 

The piano group scored 15% higher than the English 
group in a test of what they had learned in the computer 
game—and 27% higher on the questions devoted to pro- 
portional math. These gains were on top of the finding that 
the computer game alone boosted scores by 36% over the 
control group. 

Shaw says the improvements suggest that spatial aware- 
ness and the need to think several steps ahead—both re- 
quired in piano playing—treinforce latent neuronal patterns. 
“Music is just tapping into this internal neural structure that 
we’re born with,” he says. Piano lessons may well condi- 
tion the brain just as muscle-building conditions an athlete, 
says Michael Merzenich, a neuroanatomist at the Univer- 
sity of California, San Francisco. Music may be a “skill 
... more fundamental than language” for refining the abil- 
ity of the brain to make spatial and temporal distinctions, 
he says. 


Source: Reprinted with permission of Constance Holden, “Music 
as Brain Builder,” Science, Vol. 283, March 26, 1999, p. 2007. 
Copyright © 1999 American Association for the Advancement of 
Science. 


Text not available due to copyright restrictions 


Appendix 


Text not available due to copyright restrictions 


529 


530 Appendix 


Text not available due to copyright restrictions 


Text not available due to copyright restrictions 


Text not available due to copyright restrictions 


Text not available due to copyright restrictions 


Appendix 531 


Text not available due to copyright restrictions Text not available due to copyright restrictions 


532 Appendix 


Text not available due to copyright restrictions 


Text not available due to copyright restrictions 


Text not available due to copyright restrictions 


Text not available due to copyright restrictions 


News Story 11 
Double trouble behind the wheel 
By John O’Neil 


Fatigue and even a legally permissible amount of alcohol 
can combine dangerously for drivers, a British study has 
found, especially because the young men being tested ap- 
peared to be unaware of their increased impairment. 
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After getting five hours of sleep, 12 men who had the 
equivalent of two glasses of beer with lunch drifted out of 
their lanes an average of 32 times in a simulated two-hour 
highway drive. 

When they were sleepy but had not had alcohol, they 
drifted out 18 times, and when they had the drinks but were 
rested, they strayed 15 times. 

When they were rested and had not drunk alcohol, they 
left their lanes only seven times, according to the study, 
which was published in Occupational and Environmental 
Medicine, a journal of the British Medical Association. 

The study’s lead researcher, Dr. Jim Horne of Lough- 
borough University in Leicestershire, said the drivers’ 
blood-alcohol levels had fallen to less than half of the legal 
limit in Britain, 0.08, by the start of the test and to almost 
nothing by the end of the simulation. 

“Sleepy drivers having consumed some alcohol may not 
realize how bad their driving is,’ Dr. Horne said. 

The study used men with an average age of 22 as the 
subjects because 90 percent of British crashes listed as 
sleep-related are caused by men, and half of those are 
caused by men under 30. 

“Advice: don’t drink alcohol and drive, especially if you 
are sleepy,” Dr. Horne said. 


Source: From John O’Neil, New York Times, Sept. 2, 2003, 
p. F.6, “Double trouble behind the wheel.” Copyright © 2003 
New York Times Co. Reprinted with permission. 


News Story 12 
Working nights may increase breast cancer risk 
By Paul Recer 


Women who work nights may increase their breast cancer 
risk by up to 60 percent, according to two studies that sug- 
gest bright light in the dark hours decreases melatonin se- 
cretion and increases estrogen levels. 

Two independent studies, using different methods, 
found increased risk of breast cancer among women who 
worked night shifts for many years. The studies, both ap- 
pearing in the Journal of the National Cancer Institute, 
suggested a “dose effect,” meaning that the more time spent 
working nights, the greater the risk of breast cancer. 

“We are just beginning to see evidence emerge on the 
health effects of shift work,” said Scott Davis, an epidemi- 
ologist at the Fred Hutchinson Cancer Research Center in 
Seattle and first author of one of the studies. He said more 
research was needed, however, before a compelling case 
could be made to change night work schedules. 
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“The numbers in our study are small, but they are sta- 
tistically significant,” said Francine Laden, a researcher at 
Brigham and Women’s Hospital in Boston and co-author of 
the second study. 

“These studies are fascinating and provocative,” said 
Larry Norton of the Memorial Sloan-Kettering Cancer 
Center in New York. “Both studies have to be respected.” 

But Norton said the findings only hint at an effect and 
raise “questions that must be addressed with more 
research.” 

In Davis’ study, researchers explored the work history of 
763 women with breast cancer and 741 women without the 
disease. 

They found that women who regularly worked night 
shifts for three years or less were about 40 percent more 
likely to have breast cancer than women who did not work 
such shifts. Women who worked at night for more than 
three years were 60 percent more likely. 

The Brigham and Women’s study, by Laden and her col- 
leagues, found only a “moderately increased risk of breast 
cancer after extended periods of working rotating night 
shifts.” 

The study was based on the “medical and work histories 
of” more than 78,000 nurses from 1988 through May 1998. 
It found that nurses who worked rotating night shifts at 
least three times a month for one to 29 years were about 8 
percent more likely to develop breast cancer. For those who 
worked the shifts for more than 30 years, the relative risk of 
breast cancer went up by 36 percent. 

American women have a 12.5 percent lifetime risk of 
developing breast cancer, according to the American Can- 
cer Society. Laden said her study means that the lifetime 
risk of breast cancer for longtime shift workers could rise 
above 16 percent. There are about 175,000 new cases of 
breast cancer diagnosed annually in the United States and 
about 43,700 deaths. Breast cancer is second only to lung 
cancer in causing cancer deaths among women. 

Both of the Journal studies suggested that the increased 
breast cancer risk among shift workers is caused by 
changes in the body’s natural melatonin cycle because of 
exposure to bright lights during the dark hours. 

Melatonin is produced by the pineal gland during the 
night. Studies have shown that bright light reduces the se- 
cretion of melatonin. In women, this may lead to an in- 
crease in estrogen production; increased estrogen levels 
have been linked to breast cancer. 

“If you exposed someone to bright light at night, the 
normal rise in melatonin will diminish or disappear alto- 
gether,’ said Davis. “There is evidence that this can in- 
crease the production of reproductive hormones, including 
estrogen.” Davis said changes in melatonin levels in men 


doing nighttime shift work may increase the risk of some 
types of male cancer, such as prostate cancer, but he knows 
of no study that has addressed this specifically. 


Source: From Paul Recer, “Night Shift Linked to Breast Cancer 
Risk,” Associated Press, October 17, 2001. Reprinted with per- 
mission of the Associated Press. 


News Story 13 


3 Factors key for drug use in kids: 
A study says stress, boredom and extra 
cash boost the risk for substance abuse 


By Jennifer C. Kerr 


A survey of American children and parents released Tuesday 
found a mix of three ingredients in abundance for many kids 
can lead to substance abuse: boredom, stress and extra money. 

The annual study by Columbia University’s National 
Center on Addiction and Substance Abuse also found stu- 
dents attending smaller schools or religious schools are less 
likely to abuse drugs and alcohol. 

Joseph Califano Jr., the center’s chairman and president, 
said 13.8 million teens are at moderate or high risk of sub- 
stance abuse. 

“Parental engagement in their child’s life is the best pro- 
tection Mom and Dad can provide,” he said. 

The study found that children ages 12 to 17 who are fre- 
quently bored are 50 percent more likely to smoke, drink, 
get drunk or use illegal drugs. And kids with $25 or more a 
week in spending money are nearly twice as likely to 
smoke, drink or use drugs as children with less money. 

Anxiety is another risk factor. The study found that 
youngsters who said they’re highly stressed are twice as 
likely as low-stress kids to smoke, drink or use drugs. 

High stress was experienced among girls more than 
boys, with nearly one in three girls saying they were highly 
stressed compared with fewer than one in four boys. One 
possible factor is social pressure for girls to have sex, re- 
searchers said. 

Charles Curie, administrator of the Substance Abuse 
and Mental Health Services Administration, said his agency 
has found similar risk factors among U.S. youth. 

He said the best thing parents can do to steer their kids 
away from drugs and alcohol is to talk to them and stay in- 
volved in their lives. It’s also important, he said, to know 
their children’s friends. 

For the first time in the survey’s eight-year history, 
young people said they are as concerned about social and 
academic pressures as they are about drugs. In the past, 
Califano said, drugs were by far the No. | pressure on kids. 


There was some encouraging news. The study found 
that 56 percent of those surveyed have no friends who reg- 
ularly drink, up from 52 percent in 2002. Nearly 70 percent 
have no friends who use marijuana. 

Among the study’s other findings: 


e The average age of first use is about 12 for alcohol, 
12 1/2 for cigarettes and nearly 14 for pot. 

e More than five million children ages 12 to 17, or 20 
percent, said they could buy marijuana in an hour or 
less. Another five million said they could buy it 
within a day. 

e Kids at schools with more than 1,200 students are 
twice as likely as those attending schools with fewer 
than 800 students to be at high risk for substance 
abuse. 


QEV Analytics surveyed 1,987 children ages 12 to 17 and 
504 parents, 403 of whom were parents of interviewed kids. 
They were interviewed from March 30 to June 14. The mar- 
gin of error was plus or minus two percentage points for chil- 
dren and plus or minus four percentage points for parents. 

[The news article was accompanied by an illustration, 
titled “Teen drug risks: American teens reporting high 
stress, boredom or disposable income were more likely to 
use drugs, a recent survey found.” There were three bar 
graphs illustrating the percent of students who had tried al- 
cohol and marijuana. The first graph showed four levels of 
weekly disposable income, the next showed stress levels 
categorized as low, moderate, and high, and the final graph 
compared students who were not bored and often bored. 
The illustration was accompanied by this explanation. 
“About this poll: National telephone survey of 1,987 12- to 
17-year-olds was conducted March 30 to June 14. The mar- 
gin of error was plus or minus 2 percentage points. Source: 
National Center on Addiction and Substance Abuse at Co- 
lumbia University.”’] 


Source: From “3 Factors key for drug use in kids,” by Jennifer 
C. Kerr, Associated Press, August 20, 2003. Reprinted with per- 
mission of the Associated Press. 


News Story 14 
Study: Emotion hones women's memory 
By Paul Recer 


Matrimonial lore says husbands never remember marital 
spats and wives never forget. A new study suggests a rea- 
son: Women’s brains are wired both to feel and to recall 
emotions more keenly than the brains of men. 
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A team of psychologists tested groups of women and 
men for their ability to recall or recognize highly evocative 
photographs three weeks after first seeing them and found 
the women’s recollections were 10 percentage points to 15 
percentage points more accurate. 

The study, appearing in the Proceedings of the National 
Academy of Sciences, also used MRIs to image the sub- 
jects’ brains as they were exposed to the pictures. It found 
the women’s neural responses to emotional scenes were 
much more active than the men’s. 

Turhan Canli, an assistant professor of psychology at 
SUNY, Stony Brook, said the study shows a woman’s brain 
is better organized to perceive and remember emotions. 

The findings by Canli and researchers from Stanford 
University are consistent with earlier research that found 
differences in the workings of the minds of women and 
men, said Diane Halpern, director of the Berger Institute for 
Work, Family, and Children and a professor of psychology 
at Claremont McKenna College in California. 

Halpern said the study “makes a strong link between 
cognitive behavior and a brain structure that gets activated” 
when exposed to emotional stimuli. 

“Tt advances our understanding of the link between cog- 
nition and the underlying brain structures,” she said. “But it 
doesn’t mean that those are immutable, that they can’t 
change with experience.” 

Canli said the study may help move science closer to 
finding a biological basis to explain why clinical depression 
is much more common in women than in men. 

Canli said a risk factor for depression is rumination, or 
dwelling on a memory and reviewing it time after time. The 
study illuminates a possible biological basis for rumination, 
he said. 

Halpern said the study also supports findings that 
women, in general, have a better autobiographical memory 
for anything, not just emotional events. 

She said the study supports the folkloric idea that a wife 
has a truer memory for marital spats than does her husband. 

“One reason for that is that it has more meaning for 
women and they process it a little more,’ Halpern said. 
“But you can’t say that we’ve found the brain basis for this, 
because our brains are constantly changing.” 

In the study, Canli and his colleagues individually tested 
the emotional memory of 12 women and 12 men using a set 
of pictures. Some of the pictures were ordinary, and others 
were designed to evoke strong emotions. 

Each of the subjects viewed the pictures and graded 
them on a three-point scale ranging from “not emotionally 
intense” to “extremely emotionally intense.” 

As the subjects viewed the pictures, images were taken 
of their brains using magnetic resonance imaging. This 
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measures neural blood flow and identifies portions of the 
brain that are active. 

Canli said women and men had different emotional re- 
sponses to the same photos. For instance, the men would 
see a gun and call it neutral, but for women it would be 
“highly, highly negative” and evoke strong emotions. 

Neutral pictures showed such things as a fireplug, a 
book case or an ordinary landscape. 

The pictures most often rated emotionally intense 
showed dead bodies, gravestones and crying people. A pic- 
ture of a dirty toilet prompted a strong emotional response, 
especially from the women, Canli said. 

All the test subjects returned to the lab three weeks later 
and were surprised to learn they would be asked to remem- 
ber the pictures they had seen. Canli said they were not told 
earlier they would be asked to recall pictures from the ear- 
lier session. 

In a memory test tailored for each person, they were 
asked to pick out pictures they earlier rated as “extremely 
emotionally intense.” The pictures were mixed among 48 
new pictures. Each image was displayed for less than three 
seconds. 

“For pictures that were highly emotional, men recalled 
around 60 percent and women were at about 75 percent,” 
Canli said. 


Source: From “Study: Emotion hones women’s memory,” by 
Paul Recer, Associated Press, July 23, 2002. Reprinted with per- 
mission of the Associated Press. 
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News Story 16 
More on TV violence 
By Constance Holden 


An unusual longitudinal study has strengthened the case 
that children who watch violent TV become more aggres- 
sive adults, but agreement is still elusive on this long- 
smoldering issue. 

L. Rowell Huesmann and colleagues at the University of 
Michigan, Ann Arbor, studied 557 young Chicago children 
in the 1970s and found that over a 3-year period their TV 
habits predicted childhood aggression. Now they’ve done a 
15-year follow-up on 329 of their subjects. In this month’s 
issue of Developmental Psychology, they report that people 
who watched violent shows at age 8 were more aggressive in 
their 20s. Men who had been in the top 20% of violent TV 
watchers as children were twice as likely to push their wives 
around; women viewers were more likely to have thrown 
something at their husbands. The differences persisted, the 
researchers say, even when they controlled for children’s ini- 
tial aggression levels, IQs, social status, and bad parenting. 
Study co-author Leonard Eron thinks the association is now 
airtight—“‘just as much as smoking causes lung cancer.” 

Some researchers agree. The study “shows more clearly 
than any other that TV is more than just an amplifying fac- 
tor: It alone can cause increases in aggression,” says Duke 
University biologist Peter Klopfer. But skeptics remain un- 
convinced. “We already know that exposure to media vio- 
lence is associated with aggressive behavior,” says 
biostatistician Richard Hockey of Mater Hospital in Bris- 
bane, Australia. And the most plausible explanation is still 
that “aggressive people like violent TV.” Hockey adds that 
if causation exists, the effect is modest: Correlations of 
childhood violent TV viewing with adult aggression hover 
around 0.2, which means that TV contributes just about 5% 
of the increase in aggressive behavior. 


Source: From “More on TV Violence,” Science, Vol. 299, 
March 21, 2003, p. 1839, by Constance Holden. Reprinted 
with permission of the American Association for the Advance- 
ment of Science. 


News Story 17 
Even when monkeying around life isn't fair 
By Jamie Talan 


It turns out monkeys know when they get a raw deal— 
which, in this case, is a cucumber. 

In a new study, scientists report that for the first time, a 
species of nonhuman primate knows when it has been 
treated unfairly. 

Scientists at Yerkes National Primate Research Center 
of Emory University in Atlanta studied capuchin monkeys, 
known for their social system built on cooperation. 

The social groups consist of one dominant male and 
three to six adult females and their offspring. 

The study involved a simple exchange: Two female res- 
idents of a colony were each given a small token—a gran- 
ite pebble. 

When they returned it to the researcher, they received a 
slice of cucumber. Those exchanges were completed 95 
percent of the time. 

But when one monkey got a grape instead, the rate of 
completed exchanges fell to 60 percent as the other monkey 
often refused to accept the cucumber slice, took it and 
threw it away or handed it to the other capuchin. 

Finally, one monkey would get a grape without having 
to do any work—and the completion rate fell to 20 percent. 

When the other monkey recognized the disparity in ef- 
fort and reward, she sometimes would even refuse to hand 
over the pebble. 

The same findings were observed in each of five pairs. 
The study appears in Thursday’s issue of Nature. 

The researchers, Frans de Waal, an endowed professor 
of primate behavior and director of the Living Links Cen- 
ter at Yerkes, and Sarah Brosnan, a graduate student, didn’t 
find the same behavior in male monkeys. 

Brosnan said she suspects that’s because adult males 
don’t live together cooperatively. 

“These female monkeys don’t like it when someone gets 
a better deal,” Brosnan said. 

The grape-deprived monkeys didn’t show any emotional 
reaction toward their partners—but they did ignore the 
scientist. 

“They knew I was the source of the inequality,” she said. 

The researchers said their work is aimed at understand- 
ing the evolutionary development of social fairness. 

They’re repeating the study in chimps, a species more 
closely related to humans. 

Robert Frank, an endowed professor of economics at 
Johnson School of Management at Cornell University, said 
he had never seen an animal study “where someone turns 
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down an offer because it wasn’t good enough”—a trait he 
said is common in human exchanges. 

“People often pass on an available reward because it is 
not what they expect or think is fair,’ Brosnan said. “Such 
irrational behavior has baffled scientists and economists, 
who traditionally have argued all economic decisions are 
rational. 

Our findings in nonhuman primates indicate the emo- 
tional sense of fairness plays a key role in such decision- 
making.” 

Source: From “Even when monkeying around, life isn’t fair,” 
San Antonio Express-News, Sept. 21, 2003, p. 18A. Reproduced 
with permission of SAN ANTONIO EXPRESS-NEWS. 


News Story 18 
Heavier babies become smarter adults, study shows 
By Emma Ross 


In the biggest study to date examining the influence of birth 
weight on intelligence, scientists have found that babies born 
on the heavy side of normal tend to be brighter as adults. 

Experts have long known that premature or underweight 
babies tend to be less intelligent as children. 

But the study, published this week in the British Med- 
ical Journal, found that among children whose birth weight 
was higher than 5.5 pounds—considered to be normal—the 
bigger the baby, the smarter it was likely to be. 

Scientists think it has something to do with bigger ba- 
bies having bigger brains, or perhaps with having more 
connections within their brains. 

But the lead researcher on the project said there was no 
need for parents of smaller infants to despair—the results 
were averages and size at birth does not necessarily deter- 
mine intellectual destiny. 

“Birth weight is only one of numerous factors that in- 
fluence cognitive function. It may not actually be a very 
powerful one,” said Marcus Richards, a psychologist at 
Britain’s Medical Research Council who conducted the 
study. “Parental interest in education—being in the PTA 
and getting involved in your child’s homework—has an 
enormous impact, one that may even offset the effect of 
birth weight.” 

Similarly, Richards said, the head start enjoyed by hefty 
babies can be squandered. Living in an overcrowded home, 
breathing polluted air or being caught in the middle of a di- 
vorce tend to diminish children’s intelligence scores, he said. 

The scientists found that birth size influenced intelli- 
gence until about the age of 26. After that, it tended to even 
out, as other factors began to play a more important role. 
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The study did not offer concrete examples, such as how 
many IQ points’ advantage a 10-pound baby might have 
over a 7-pound baby. 

And of course, there are always exceptions. 

The research involved 3,900 British men and women 
who were born in 1946 and followed since birth. Their in- 
telligence was measured by a battery of tests at the ages of 
8, 11, 15, 26 and 43. 

Increasing intelligence corresponded with increasing 
birth weight until the age of 26. By the age of 43, the effect 
was weaker. 

How brainy the children were at 8 seemed to be the most 
important influence on later intelligence, the study found. 

Heavier babies went on to achieve higher academic 
qualifications. That outcome was mostly linked to how 
brainy they were at age 8. 

“Tt seems birth weight does what it does by age 8 and 
that that puts you on a path,” Richards said. 

But the effect seemed to have waned by the age of 43, 
by which time the smaller babies apparently caught up. 

The results were not affected by birth order, gender, fa- 
ther’s social class or mother’s education and age. Even af- 
ter the babies who were underweight were excluded, the 
link remained strong. 

“This is an important finding that shows how strong the 
link is. We’ve seen it in low birth-weight babies, but this 
shows that even if you are a normal weight baby, bigger is 
better, at least when it comes to intelligence,’ said Dr. 
Catherine Gale, who has conducted similar research at 
Southampton University in England. 

Experts don’t know exactly what makes a heavy baby, 
but Gale said well-built, well-nourished mothers tend to 
produce heavier babies. 

Mothers who eat badly, smoke and are heavy drinkers 
tend to produce smaller babies. However, experts don’t know 
whether those factors influence the relationship between 
birth weight and intelligence. There are probably several 
other variables that affect birth weight, but which of those are 
connected to intelligence is not known, Richards said. 


Source: From Emma Ross, “Heavier Babies become smarter 
adults, study shows,” Associated Press, January 26, 2001. 
Reprinted with permission of the Associated Press. 


News Story 19 
Young romance may lead to depression 
By Malcolm Ritter 


The most famous youthful romance in the English-speak- 
ing world, that star-crossed love of Romeo and Juliet, was 


a tragedy. Now researchers have published a huge study of 
real-life adolescents in love. 

It’s also no comedy. 

The results suggest that on balance, falling in love 
makes adolescents more depressed, and more prone to 
delinquency and alcohol abuse than they would have been 
if they’d avoided romance. 

The reported effect on depression is small, but it’s big- 
ger for girls than boys. The researchers suggest it could be 
one reason teen girls show higher rates of depression than 
teen boys do, a difference that persists into adulthood. 

This is not exactly the view of romance that prevails 
around Valentine’s Day. Researchers who’ve studied 
teen-age love say that smaller studies had shown teen ro- 
mance can cause emotional trouble, but that the new work 
overlooked some good things. 

The study was done by sociologists Kara Joyner of Cor- 
nell University and J. Richard Udry of the University of 
North Carolina at Chapel Hill. They presented the results in 
the December issue of the Journal of Health & Social 
Behavior. 

Their results are based on responses from about 8,200 
adolescents across the country who were interviewed twice, 
about a year apart, about a wide variety of things. The kids 
were ages 12 to 17 at the first interview. 

To measure levels of depression, the researchers exam- 
ined adolescents’ answers to 11 questions about the previ- 
ous week, such as how often they felt they couldn’t shake 
off the blues, felt lonely or sad or got bothered by things 
that normally wouldn’t faze them. 

To see what love’s got to do with it, the researchers 
compared responses from adolescents who didn’t report 
any romantic involvement at either interview with those 
who reported it at both interviews. They looked at how 
much depression levels changed between interviews for 
each group. 

The finding: The romantically involved adolescents 
showed a bigger increase in depression levels, or a smaller 
decrease, than uninvolved teens. 

The difference wasn’t much. For boys of all ages, it was 
about one-half point on a 33-point scale. Girls were hit 
harder, with a 2-point difference for girls who’d been 12 at 
the first interview, and diminishing with age to about a half- 
point difference for girls who’d been 17. 

The results were a surprise, because studies of adults 
have shown married people tend to be less depressed than 
single ones, Joyner said. So why would love lower adoles- 
cent mood? By analyzing the adolescents’ answers to other 
questions, Joyner and Udry found evidence for three possi- 
ble factors: deteriorating relationships with parents, poorer 
performance in school and breakups of relationships. 


In fact, it appeared that for boys, romance made a dif- 
ference in depression only if they’d had a breakup between 
interviews. For girls, in contrast, the biggest impact from 
romance seemed to come from a rockier relationship with 
Mom and Dad. That was especially so among younger 
girls, where the bump in depression was biggest. 

To Joyner, it makes sense that “if a young daughter is 
dating, her parents may be concerned about her choice of 
partner or what she is doing with him. Presumably, their 
concern leads to arguments. That would be my guess.” 

But it’s only a guess. The study can’t prove what caused 
what. Maybe girls feeling less loved at home were more 
likely to seek romance with a guy, rather than the other way 
around. 


Source: From “Young romance may lead to depression, study 
says,” by Malcolm Ritter, February 14, 2001. Associated Press. 
Reprinted with permission of the Associated Press. 


News Story 20 


Eating organic foods reduces pesticide 
concentrations in children 


Environmental Health Perspectives 


A study published today in the March issue of the peer- 
reviewed journal Environmental Health Perspectives (EHP) 
found that consuming organic produce and juice may lower 
children’s exposure to potentially damaging pesticides. 
The researchers compared levels of organophosphorus 
(OP) pesticide metabolites in 39 children aged 2-5 who 
consumed nearly all conventional produce and juice versus 
those who consumed nearly all organic produce and juice. 
The children eating primarily organic diets had signifi- 
cantly lower OP pesticide metabolite concentrations than 
did the children eating conventional diets. Concentrations 
of one OP metabolite group, dimethyl metabolites, were 


Appendix 539 


approximately six times higher for the children eating con- 
ventional diets. Studies suggest that chronic low-level ex- 
posure to OP pesticides may affect neurologic functioning, 
neurodevelopment, and growth in children. 

“The dose estimates suggest that consumption of organic 
fruits, vegetables, and juice can reduce children’s exposure 
levels from above to below the U.S. Environmental Protec- 
tion Agency’s current guidelines, thereby shifting exposures 
from a range of uncertain risk to a range of negligible risk,” 
the study authors wrote. “Consumption of organic produce 
appears to provide a relatively simple way for parents to re- 
duce their children’s exposure to OP pesticides.” 

In this study, families were recruited from both a retail 
chain grocery store selling primarily conventional foods and 
from a local consumer cooperative selling a large variety of 
organic foods. The parents kept a food diary for their chil- 
dren for three days. Urine collected from the children on day 
three was analyzed for pesticide metabolites. All of the fam- 
ilies lived in the Seattle, Washington, metropolitan area. 

“Organic foods have been growing in popularity over 
the last several years,” says Dr. Jim Burkhart, science edi- 
tor for EHP. “These scientists studied one potential area of 
difference from the use of organic foods, and the findings 
are compelling.” 

The study team included Cynthia L. Curl, Richard A. 
Fenske, and Kai Elgethun of the School of Public Health 
and Community Medicine at the University of Washington. 

EHP is the journal of the National Institute of Environ- 
mental Health Sciences, part of the U.S. Department of 
Health and Human Services. More information is available 
online at http://www.ehponline.org/ 


Source: From “Eating Organic Foods Reduces Pesticide Concen- 
trations in Children,” http://www.newfarm.org/news/030503/ 
pesticide_kids.shtml (The New Farm, News and Research, online; 
Copyright The Rodale Institute). Reprinted with permission of 
The Rodale Institute. 
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Solutions to Selected Exercises 


Chapter 1 


3. 


13. 


15. 


17. 


If the measurements of interest are extremely vari- 
able, then a large sample will be required to detect 
real differences between groups or treatments. 


. Visit both numerous times because waiting times 


vary, then compare the averages. 


. a. No. It would be unethical to randomly assign peo- 


ple to smoke cigars or not. 

b. No. People who smoke cigars may also drink more 
alcohol, eat differently, or have some other charac- 
teristic that is the actual cause of the higher rate of 
cancer. 

. Randomized experiment 

b. Yes, because of the randomization all other fea- 

tures, such as amount of water and sun, should be 
similar for the two groups of plants. 

The sample is the few thousand asked; the population 

is all adults in the nation. 

a. Observational study 

b. No. It could be that people who choose to meditate 

tend to have lower blood pressure anyway for 
other reasons. You cannot establish a causal con- 
nection with an observational study. 


D 


Chapter 2 


3. 


12. 


15. 


a. Biased, the guards; unbiased, trained independent 
interviewers 


. Major and grade-point average 
. No. For example, college major 
. a. Version 2 


b. Version 1 

A statistical difference may not have any practical 
importance. 

Use Component 3; volunteer respondents versus na- 
tionwide random sample 


Chapter 3 


1. 


12. 
16. 


19. 
22. 


23. 


25. 


a. Gender (male/female) 

b. Time measured on a clock that is 5 minutes fast 

c. Weight of packages on a postal scale that weighs 
items 1 oz high half the time and 1 oz low half the 
time 

a. Deliberate bias and unnecessary complexity 

. Do you support or not support banning prayers in 

schools? 

. Measurement 

. Categorical 

. Discrete 

Continuous 

. Yes. Nominal variables are all categorical 

variables. 

It would be easier with only a little natural variability. 

Although only one-fifth favored forbidding public 

speeches (version A), almost one-half did not want to 

allow them (version B). Americans appear to be reluc- 
tant to forbid this type of free speech (four-fifths 
didn’t want to do so), but they aren’t as reluctant to 
withhold approval, with almost half willing to do so. 

Anonymous 

a. No, there is so much overlap in the times that it 
would be difficult to make a definite conclusion 
about which one is really faster based on only five 
times for each route. 

b. Route 1 is always 14 minutes, and Route 2 is al- 
ways 16 minutes. 

Beer consumption measured in bottles (discrete) or 

ounces (continuous) 

a. Systolic blood pressure is likely to differ due to all 
three causes. People are different, each person’s 
blood pressure changes over time, and there is 
measurement error. 

c. Natural variability across individuals. Because 
they are all measured at the same time, and the 
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26. 


29. 


33. 
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measurement should be accurate, the other two 
sources are not involved. 

a. Natural variability across time and measurement 
error are both likely to cause variability in blood 
pressure measurements for the same person across 
days. 

b. Blood type doesn’t change over time for a given 
individual and is measured accurately, so there 
would be no variability. 

Here is how the story explains it: “To measure levels 

of depression, the researchers examined adolescents’ 

answers to 11 questions about the previous week, 
such as how often they felt they couldn’t shake off the 
blues, felt lonely or sad or got bothered by things that 

normally wouldn’t faze them” (from News Story 19 

in the Appendix). 

a. Likely to be valid because it measures the 
severity of 13 of the most common hangover 
symptoms 


Chapter 4 


1. 


12. 


15. 


17. 
18. 
22. 
26. 


28. 


a. Cluster sampling 

b. Systematic sampling 

. Stratified sampling 

. Convenience or haphazard sampling 

T% 

. 73% to 87% 

. Cluster sampling 

. Survey 

. Randomized experiment 

A low response rate refers to a survey given to a legit- 

imate random sample, but where many don’t respond. 

A volunteer sample consists of those who respond to 

a publicly announced survey. A volunteer sample is 

worse because there is no way to ascertain whether or 

not it represents any larger group. 

b. Yes. The margin of error tells us that the interval 
from 52% to 60% most likely covers the truth. 

a. Case study 

No 

Not reaching the individuals selected 

c. Field poll is more likely to represent the 
population. The SFGate poll was based on a vol- 
unteer sample, not likely to represent any larger 
group. 

Used a convenience sample of students in introduc- 

tory psychology. They probably represent students at 

colleges similar to that one who take introductory 


psychology. 


SFpeeprean 


Chapter 5 


2. 


30. 


31. 


a. Single-blind because customers could read the me- 
ters and knew the plan; neither 

b. Single-blind because drivers knew what they had 
taken; block design 

c. Double-blind; block design 

a. The color of the cloth is a confounding variable. 
Maybe birds prefer red. 


. The placebo effect 
. Explanatory is form of exercise; response is weight 


loss. 


. Yes. Randomly assign treatments to the volunteers. 


Generalizability may be a problem if volunteers don’t 
represent the population. 


. a. Ecological validity; experimenter effect 


b. Hawthorne effect 


. a. Explanatory variable is regular church attendance 


or not, and response variable is age at death. 


. a. The Davis study was case control but the Laden 


study was not. 

b. The Davis study was retrospective and the Laden 
study was prospective. 

b. A cause-and-effect conclusion is not justified be- 
cause spending money was not randomly assigned; 
this was an observational study. 

b. The teens were randomly selected from a larger 
population, so results can be extended. 


Chapter 7 


4. 
9. 


13. 


15. 
21. 


25. 


28. 


30. 


35. 


Sales prices of cars 

a. 65 

b. 54, 62.5, 65, 70, 78 

a. Skewed to the right, causing the mean to be much 
higher than the median 

b. Median because half of the users lost that much or 
more, half that much or less 

Incomes for musicians 

Range because it’s determined solely by the two 

extreme values, the highest and lowest 

Temperatures for the entire year, with one “cold” 

mode and one “warm” mode 

c. The mean would be higher because of the large 
outliers. 

Mean is 10.91, median is 4.5. Median is a better 

representation. 

a. Organic: 0, 0.2, 0.2, 0.4, 1.3; conventional: 0, 0.5, 
1.4, 3.4, 8.3 


Chapter 8 


2. a. 10% 
b. 60% 
1% 
2.05 
—0.67 
8. a. z-score is about —0.41, so it’s the 34th percentile 
10. z-score is 1.88, so cholesterol level is 208.8 
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12. a. Lower quartile is z = —0.67 and upper quartile is 
z = 0.67 

15. a. 68% in 84 to 116, 95% in 68 to 132, 99.7% in 52 
to 148 


18. 17.56 minutes because 90th percentile has a z-score 
of 1.28 
20. a. 0.75 or 75% 
b. No, that’s the average. Normal temperatures cover 
a wide range. 
24. The mean is much higher than the median, so they 
must be skewed or have large outliers. 


Chapter 9 


1. a. Pie chart 
b. Histogram 
c. Bar graph 
d. Scatterplot 

5. The horizontal axis does not maintain a constant 

scale. 

8. b. Population has steadily increased, so numbers of 
violent crimes over the years would increase even 
if the rate stayed constant. 

12. In general, not starting at zero allows changes to be 
seen more readily, while starting at zero emphasizes 
the actual magnitude of the measurements. 

16. a. Bar graph 


Chapter 10 


1. Yes, about 5 of them 
3. a. y = 0.96 (x — 0.5) = —0.48 + 0.96x (in pints) 
7. A correlation of —0.6 implies a stronger relationship. 
11. a. Predicted ideal weight for a 150-lb woman is 
133.9 1b. 
12. a. Negative, because the slope is negative 
. Predicted time is 35.87, 1.45 seconds too high. 
c. They decrease by 0.1142 second per year, or about 
0.46 second per 4-year Olympics. 
16. Yes. The strength of the relationship is the same as 
for the golfers and the sample is much larger. 


a 
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Chapter 11 


2. No. People who prefer to exercise (walk) may also be 
people who keep their weight under control, eat less 
fat, and so on. 

5. Winning time in an Olympic running event and the 
cost of running shoes are likely to be negatively cor- 
related because prices go up and winning times go 
down over time. 

7. d. Combining groups inappropriately 

10. Salary and years of education in a company where the 
president makes $1 million a year but has an eighth- 
grade education 

14. No. There are confounding factors, such as more 
sports cars are red. 

17. No, there are too many other possible factors. For in- 
stance, if the first winter was very harsh and the sec- 
ond one mild, there would be more fatalities in the 
first due just to the weather. 

22. All of reasons | to 6 are likely to apply. For instance, 
for Reason 2, the response variable (unhealthful 
snacking) may be causing changes in the explanatory 
variable (stress) because of chemical effects of the 
poor diet. 


Chapter 12 


4. Relative risk = 10 
6. a. Retrospective observational study, probably case- 
control 

9. a. 13.9%, 0.106, 212/1525 or 0.139, 465 to 3912 

. Proportion or risk 

13. c. No. It could be that drug use would decline in that 
time period anyway, or drugs of choice would 
change. 

17. b. 76.1% of blacks were approved and 84.7% of 

whites. Proportions are 0.761 and 0.847. 

. Relative risk 

21. a. The increased risk of dying from circulatory dis- 
ease is reported to be 21%. 
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Chapter 13 


2. They are equivalent; the probability is usually set 
at .05. 
5. They are expected if the null hypothesis is true. 
8. a. No, 1.42 < 3.84 
c. Yes, 0.01 < 0.05 
11. a. Null hypothesis: There is no relationship between 
sex and preferred candidate for the population of 
voters. 
d. 10.1 
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15. a. 0.28 or 28% for males; 0.23 or 23% for females 

b. 8.64; statistically significant 

c. No. The chi-square statistic would be about 0.86. 

a. First null hypothesis: For the population of college 
students who drink there is no relationship be- 
tween gender and experiencing vomiting as a 
hangover symptom. 

21. b. Expected count = (466)(895)/1215 

c. Chi-square statistic = 5.35 
24. b. Expected count for Male, Yes is 27. 


19. 


Chapter 14 


2. a. (Partial answer) There was a 72% increase from 
1940 to 1950, calculated as (24.1 — 14.0)/14.0 = 
0.72, or 72%. 

3. 233.33 for 1981 and 466 for 1995; the cost of a pa- 
perback novel in 1995 was 466% of what it was in 
1968. 

7. $56.07 

12. What matters to consumers is how prices are chang- 
ing, not how current prices compare with the base 
year prices. 

17. “Average duration of unemployment, in weeks, is lag- 


ging.” 


Chapter 15 


1. a. Positive 
b. Nonexistent (ignoring possible global warming) 
4. It would be better to adjust for inflation to have a fair 
comparison over the years. 
8. You would expect a positive correlation. 


10. a. Seasonal and cycles 

b. Trend and seasonal 

c. Possible seasonal and cycles 
15. a. 1994 salary would be $119,646. 

b. 1994 salary would be $487,356. 
Chapter 16 


2. a. 9%, or 0.09 
6. The probability that the earth will be uninhabitable by 
the year 2050 
9. a. 0.10, or 10% 
b. (.10)(.10), or .01; assume payments (or not) are in- 
dependent for the two customers. 
14. a. Observing the relative frequency 
17. a. Not likely to be independent; they are likely to 
share political preferences. 


24. 3.3; you would not expect this each time, but you 
would expect it in the long run. (Four years of college 
may not be enough of a long run to reach this ex- 
actly.) 

26. Discrete data, where the expected value is also the 
mode; for instance, suppose a small box of raisins 
contains 19 pieces with probability .1, 20 with proba- 
bility .8, and 21 with probability 1. Then the expected 
value is 20 raisins, which is also the most likely 
value. 


Chapter 17 


2. Anchoring 
5. A detailed scenario may increase your personal prob- 
ability of it happening because of availability. An ex- 
ample is someone trying to sell you a car alarm 
describing in detail the methods thieves use. 
9. Statement A has a higher probability, but people 
would think statement B did. 
14. The rate of the disease for people similar to you 
16. Optimism 


Chapter 18 


2. See Exercise 17 for an example. 
6. There are numerous meanings to the word pattern, 
and it is easy to find something that fits one. 
9. Yes, there are limited choices and people know their 
friends’ calling behaviors. 
12. a. 9,900/10,700 = 0.925 
b. 9,900/99,000 = 0.10 
c. 10,700/100,000 = 0.107 
17. No. Parts don’t wear out independently of each other. 
20. a. Choice A, because for choice B the expected value 
is $9.00, less than the $10 gift. 
22. b. If the odds were 1 to 3 in favor of the other team, 
you would have an expected value of zero. 


Chapter 19 


4. b. Virtually impossible; standardized score over 10 
7. Example: take a random sample of 1000 students at a 
large university and find the proportion who have a 
job during the school year. 
11. a. Bell-shaped, mean = 0.20, standard deviation = 
0.008 
b. No, the standardized score for 0.17 is —3.75. 
13. b. No, the standardized score for a mean of 7.1 is 
about 7.24. 


14. a. No, the expected number not meeting standards is 
only 3. 
b. Yes, conditions are met. 
17. Bell-shaped, with a mean of 3.1 and standard devia- 
tion of 0.05 


Chapter 20 


2. a. 0.16 to 0.28 
5. a. 68% 
b. 90% 
7. No, the volunteer (self-selected) sample is not 
representative. 
15. a. 20% to 24% 
b. 0.02 + 0.002, or 1.8% to 2.2% 
16. 25.6% to 48.4%; it almost covers chance (25%). 
20. c. 49% + 2%, or 47% to 51% 


Chapter 21 
3. a. SEM is 1/12 hour, or 5 minutes. 
6. a. The population means are probably different. 
b. The population means could be the same. 
8. a. 29 + (1.645)(16.3), or 2.2 to 55.8 
b. Yes; the interval does not cover zero. 
11. a. Confidence interval is 64.64 + 0.81, or 63.83 to 
65.45. 
16. 12% to 23%; there is a change because 0 is not in the 
interval. 


17. Method 1; the difference within couples is desired, 
not the difference across sexes. 
21. c. Multiply by 60 minutes to get 2.5 to 20.6 
minutes. 


Chapter 22 


3. b. Type 1 error 
5. a. Null: Psychotherapy and desipramine are equally 
effective in treating cocaine use. Alternative: One 
method is more effective than the other. 
c. Type 1; concluding there is a difference when 
there isn’t 

10. A type 1 error would be that no ESP is present but the 

study concludes it is being used. 

12. c. There is a link between vertex baldness and heart 
attacks in the population, but the evidence in the 
sample wasn’t strong enough to conclusively de- 
tect it. 

16. a. Minor disease with serious treatment—for exam- 
ple, tonsillitis and the treatment is surgery 

b. Being infected with HIV 


17 


20 
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. They should use a higher cutoff for the p-value to 
reduce the probability of a type 2 error. 
. d. 0.05 


Chapter 23 


2. 
4. 


7. 


10. 


15. 


19. 


One-sided 

a. No 

b. Yes, the p-value would be .04. 

a. Null: The training program has no effect on test 
scores. 

b. 2.5 

c. Alternative; scores would be higher after the pro- 
gram. 

b. Reject the null hypothesis. Conclude that elderly 
people with pets have significantly fewer doctor 
contacts than those without pets. 

b. No. The data would not have supported that alter- 
native because in the sample the dieters lost more 
fat. 

a. There would be no difference in antibody response 
to a flu vaccine for meditators and nonmeditators 
in the population. 


Chapter 24 


1 


10. 


12. 
17. 
21. 


. a. Yes 

b. No, the magnitude wasn’t much different, but the 
large sample sizes led to a statistically significant 
difference. 

a. 4.911 

a. A confidence interval for the relative risk of heart 
attack during heavy versus minimal physical exer- 
tion is from 2.0 to 6.0. 

The small sample sizes would result in low power, or 

high probability of a type 2 error even if ESP were 

present. 

1 out of 20, so 1 overall. 

A little variability 

a. No. The difference was not statistically significant. 


Chapter 25 


3. 


an 


14. 


a. They would like to make causal conclusions. 

b. Yes 

. b. Statistical significance versus practical importance 
. 1000 researchers; 50 could easily be contacted. 

. One reason is that some studies used outdated tech- 
nology. 

Vote counting could not detect a relationship, but 
meta-analysis could, by increasing the overall power. 
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Chapter 26 


4. a. Good idea 
7. No, the intervals overlap and there is no clear choice. 
One method produces a slightly higher mean but also 
has more variability. 
10. This condition was not met. The 9-year-old girl was 
in immediate proximity to the practitioners, and she 


17. 


19. 


knew which hand she was hovering over with her 
hand. 

They chose the confidence level to present the out- 
come they desired. They should have reported the 
standard 95% confidence interval. 

The complaint raised by the letter is that the study 
was not measuring what it purported to measure. 
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